Download Raw Diff

Details

Reviewers

chandlerc
aaron.ballman
rnk
chapuni

Commits

rGdd1463823a0c: This is yet another attempt to re-instate r220932 as discussed in D19271.
rGc048b6c4cd10: Revert "Revert 220932.": "Removing the static initializer in ManagedStatic.cpp…
rG14e2bcccfb84: Removing the static initializer in ManagedStatic.cpp by using llvm_call_once to…
rL271558: This is yet another attempt to re-instate r220932 as discussed in
rL220932: Removing the static initializer in ManagedStatic.cpp by using llvm_call_once to…

Summary

This patch adds an llvm_call_once which is a wrapper around std::call_once on platforms where it is available and devoid of bugs. The patch also migrates the ManagedStatic mutex to be allocated using llvm_call_once.

These changes are philosophically equivalent to the changes added in r219638, which were reverted due to a hang on Win32 which was the result of a bug in the Windows implementation of std::call_once.

Diff Detail

Event Timeline

beanz updated this revision to Diff 15285.Oct 22 2014, 4:59 PM

beanz retitled this revision from to Removing the static initializer in ManagedStatic.cpp by using llvm_call_once to initialize the ManagedStatic mutex..

beanz updated this object.

beanz edited the test plan for this revision. (Show Details)

beanz added reviewers: chandlerc, rnk, aaron.ballman, chapuni.

beanz added a subscriber: Unknown Object (MLST).

Fixed code comments based on Aaron's feedback.

Aaron, I understand your point about using the Windows one-time initialization, however I'm really not the best candidate to write Windows specific code. I am much more familiar with basic threading constructs like MemFence and CompareAndSwap than Win32 APIs. Could you provide and test a Windows one-time init solution?

Thanks,
-Chris

In D5922#4, @beanz wrote:

Fixed code comments based on Aaron's feedback.

Aaron, I understand your point about using the Windows one-time initialization, however I'm really not the best candidate to write Windows specific code. I am much more familiar with basic threading constructs like MemFence and CompareAndSwap than Win32 APIs. Could you provide and test a Windows one-time init solution?

The one time initialization APIs were added in Vista, and I don't know if we've dropped support for XP or not yet. Certainly Microsoft has. =/

I suspect that the fence and CAS implementation has subtle issues, but I'm not a C++ memory model expert. David Majnemer has a better understanding, and he said there are issues.

When I pointed him at this, he told me the template interface I recommended was a bad idea. Maybe having llvm::once_flag and llvm::call_once that alias the std:: variants when available is the way to go. WDYT?

I can re-work this to be an alias of call_once when available. I think in the absence of call_once the hand rolled solution should be sufficient. I'm curious what David thinks the issues with the hand rolled implementation are. I'm pretty good WRT lock-free multi-threading, and I don't spot anything. Also, since we're using pretty much that same code in pass initialization, if there are issues we should identify and solve them.

I would instead recommend we do something along the lines of:

#include <atomic>
#include <thread>

enum InitStatus {
  Done = -1,
  Uninitialized = 0,
  Wait = 1
};

void call_once(std::atomic<InitStatus> &Initialized, void (*fptr)(void)) {
  InitStatus Status = Initialized.load(std::memory_order_relaxed);
  for (;;) {
    // Figure out which state we are in.
    if (Status == Done) {
      // We are at Done, no further transitions are possible.
      // This acquire fence pairs with the release fense below.
      std::atomic_thread_fence(std::memory_order_acquire);
      return;
    }
    if (Status == Wait) {
      // We are waiting for a Wait -> Done transition.
      std::this_thread::yield();
      // Reload status.
      Status = Initialized.load(std::memory_order_relaxed);
      continue;
    }
    // Attempt an Uninitialized -> Wait transition.
    // If we succeed, perform initialization.
    // If we do not, Status will be updated with our new state.
    Status = Uninitialized;
    bool ShouldInit = std::atomic_compare_exchange_weak_explicit(
        &Initialized, &Status, Wait, std::memory_order_relaxed,
        std::memory_order_relaxed);
    if (ShouldInit) {
      (*fptr)();
      std::atomic_thread_fence(std::memory_order_release);
      Initialized.store(Done, std::memory_order_relaxed);
    }
  }
}

This approach ends up with no fences in the actual assembly with either GCC or clang when targeting X86-64. It has the additional benefit of not dirtying any cache lines if initialization has occurred long before.

include/llvm/Support/Threading.h
70	`volatile` doesn't really make sense here.
71	This is very expensive, you are performing a RMW operation even if nothing has changed.
77–82	Let's assume that loading `initialized` will somehow turn into a relaxed load. Why bother with the fence? You could just fence after: while (initialized != 2); sys::MemoryFence(); This will result in far, far fewer fences should this ever be contended.

It sounds like David's complaints are performance related not correctness. Unfortunately his suggested implementation doesn't really cover the needs for this. The hand-rolled solution is basically filling in on Windows where we aren't guaranteed to have <mutex>, <thread>, or <atomic>.

There are definitely more performant ways to implement the hand rolled solution, but it should be safe as implemented. That said, I will make adjustments to the implementation and submit updated patches for review.

include/llvm/Support/Threading.h
70	Why not? Having this be volatile actually alleviates the possibility of the compiler optimizing away loads.
71	Sure. For performance this should only execute if not already initialized.
77–82	The point of making initialized volatile is so that the compiler will emit a load for the while loop. If you're loading with each iteration of the loop is the fence really needed at all? It shouldn't be for correctness.

In D5922#10, @beanz wrote:

It sounds like David's complaints are performance related not correctness.

I'm afraid not. Because your implementation uses operations which are not atomic, it has undefined behavior in the eyes of C++11.

C++11 [intro.multithread]p21:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

Unfortunately his suggested implementation doesn't really cover the needs for this. The hand-rolled solution is basically filling in on Windows where we aren't guaranteed to have <mutex>, <thread>, or <atomic>.

Is there a platform where we can't use <atomic>? I'm aware of platforms where we cannot use <thread> or <mutex> but we should use <atomic> if we can. My call_once implementation only used <thread> for yield; if we removed it, we wouldn't need anything other than <atomic>.

A few notes.

According to Chandler's related RFC (http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-September/077081.html) we do not have a requirement that all platforms support <atomic>.

In D5922#11, @majnemer wrote:

I'm afraid not. Because your implementation uses operations which are not atomic, it has undefined behavior in the eyes of C++11.

C++11 [intro.multithread]p21:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

I think you're misconstruing what this means. Your statement would seem to imply that until C++11 nobody ever managed to write thread safe code-- which is absurd. There are certainly undefined behaviors in the implementation I've provided, but they are known, and appropriately handled.

Both the MemFences, and the tmp in the else branch of the original patch can be safely removed, and the code can be re-written into a while loop for performance. An example implementation would be:

enum InitStatus {
  Done = -1,
  Uninitialized = 0,
  Wait = 1
};

void llvm::call_once(once_flag& flag, void (*UserFn)(void)) {
  while (flag != Done) {
    if (flag == Wait) {
      ::Sleep(1);
      continue;
    }

    sys::cas_flag old_val = sys::CompareAndSwap(&flag, Wait, Uninitialized);
    if (old_val == Uninitialized) {
      UserFn();
      sys::MemoryFence();
      flag = Done;
    }
  }
}

I (and some of my colleagues) believe that this is perfectly safe code, and it eliminates unnecessary MemFences (which you raised concerns about). Do you see a problem with this implementation (other than the fact that it doesn't use std::atomic)?

I don't have any particular opposition to using std::atomic, however since we don't have an enforced requirement I'm not sure we can assume we have it. Please keep in mind I think we still support deploying back to Windows XP.

New patches based on feedback.

Updates in these patches:

Broke Windows and Unix code out into separate Windows.inc files.
Windows implementation uses <atomic> implementation provided by majnemer.

These patches are tested on Windows and Unix.

All the atomic stuff looks great to me.

So, bad news, std::atomic<int> has a non-trivial default constructor in the MSVC STL. This means the code void foo() { static std::atomic<int> flag; } uses a non-threadsafe guard variable to initialize flag. Worse, the guard variable is a bitfield, so it is very much not threadsafe. See the code for foo here:

  mov         eax,dword ptr [?$S1@?1??foo@@YAXXZ@4IA]            
  test        al,1                                               
  jne         initialized
  or          eax,1                                              
  mov         ecx,offset ?flag@?1??foo@@YAXXZ@4U?$atomic@H@std@@A
  mov         dword ptr [?$S1@?1??foo@@YAXXZ@4IA],eax            
  xor         eax,eax                                            
  xchg        eax,dword ptr [ecx]                                
initialized:
  ret

So... I'm running out of ideas. :( We can wait 3 years and wait for MSVC 14 to be our baseline, and then the problem of thread-safe static initialization will go away.

If I understand correctly this can still be safe as long as the flag is defined as a static global right? It will keep a static initializer around on Windows, but it should be safe.

-Chris

I don't think it's that safe.

First, if someone accesses a ManagedStatic before main (this happens today,
right?), then we have no guarantee that the flag has been initialized
because initialization order of globals is arbitrary. The global will start
out zero, but then it will be re-zeroed out later and we'll get double
initialization. =(

Second, it's hard to ensure that all callers actually use global statics
instead of static locals.

BTW, MSVC 14 *also* uses "= default" for std::atomic's default constructor,
which solves the problem with the std::atomic double checked locking
approach.

Also, I thought the motivation of this work was to eliminate dynamic
initialization for WebKit. Does WebKit no longer care about dynamic
initialization on Windows?

Ugh... So, std::atomic is off the table.

I think we can do this instead:

void llvm::call_once(once_flag &Initialized, void (*fptr)(void)) {
  while (flag != Done) {
    if (flag == Wait) {
      ::Sleep(1);
      continue;
    }

    sys::cas_flag old_val = sys::CompareAndSwap(&flag, Wait, Uninitialized);
    if (old_val == Uninitialized) {
      fptr();
      sys::MemoryFence();
      flag = Done;
    }
  }
  sys::MemoryFence();
}

It ends up being a bit conservative on MemFences on x86, but it should be safe, and since once_flag is just an unsigned, it has a trivial static initializer and should avoid the issue that std::atomic has.

Can you instead use a windows specific api for implementing call once?

Can you instead use a windows specific api for implementing call once?

The obvious API was added in Vista, so we'd have to drop support for
running LLVM on Windows XP. Given that official Microsoft support ended in
April, it's probably reasonable for us to do this.

XP will probably fall off the cart soon for other reasons. For example, if
we want new C++11 library features (std::thread!), we will need to use
newer CRTs which probably do not run on XP.

Pre-Vista MSDN basically recommends that you roll your own double checked
locking to avoid races due to static local initialization. =P

While I recognize that Chandler's point is probably right that we should use the Windows API (caveat that we're ok with dropping running on XP as Reid pointed out).

Would it be reasonable for me to land the hand-rolled double checked solution instead, and have someone else (preferably someone more skilled with Windows) implement the Windows API based solution later?

n Thu, Oct 30, 2014 at 1:29 PM, Chris Bieneman <beanz@apple.com> wrote:

While I recognize that Chandler's point is probably right that we should
use the Windows API (caveat that we're ok with dropping running on XP as
Reid pointed out).

Would it be reasonable for me to land the hand-rolled double checked
solution instead, and have someone else (preferably someone more skilled
with Windows) implement the Windows API based solution later?

Let's do that. It's a strict improvement over the single check static local
initializer we have currently.

In a week or so after the XP RFC has baked we can go back and do this
better.

Cool. I'm testing new patches now and will post them as soon as I have them ready.

Went back to the hand-rolled non-std::atomic version due to issues Reid pointed out.

lgtm

Thanks, sorry for the runaround.

This revision is now accepted and ready to land.Oct 30 2014, 2:54 PM

Closed by commit rL220932 (authored by cbieneman).

I’ve landed the change in r220932. I will watch the bots to make sure everything goes smoothly.

Thanks everyone for all the help on this.
-Chris

This broke NetBSD:

http://lab.llvm.org:8011/builders/lldb-amd64-ninja-netbsd7/builds/4137

@beanz can you have a look?

I have reproduced it locally:

[ 28%] Built target llvm-tblgen
[ 28%] Built target not
Scanning dependencies of target intrinsics_gen
Scanning dependencies of target AttributeCompatFuncTableGen
Scanning dependencies of target LibOptionsTableGen
[ 28%] Building Intrinsics.gen...
[1]   Segmentation fault (core dumped) ../../../bin/llv...
--- include/llvm/IR/Intrinsics.gen.tmp ---
*** [include/llvm/IR/Intrinsics.gen.tmp] Error code 139

make[2]: stopped in /tmp/pkgsrc-tmp/wip/llvm-git/work/build

Ouch, it was revert of revert, so I'm not sure what is the current Phabricator issue for it.

Diff 15285

include/llvm/Support/Threading.h

	Show All 9 Lines
	// This file declares helper functions for running LLVM in a multi-threaded			// This file declares helper functions for running LLVM in a multi-threaded
	// environment.			// environment.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_SUPPORT_THREADING_H			#ifndef LLVM_SUPPORT_THREADING_H
	#define LLVM_SUPPORT_THREADING_H			#define LLVM_SUPPORT_THREADING_H

				#include "llvm/Config/llvm-config.h" // for LLVM_ON_UNIX

				#if defined(LLVM_ON_UNIX)
				#include <mutex>
				#else
				#include "llvm/Support/Atomic.h"
				#include "llvm/Support/Valgrind.h"
				#endif

	namespace llvm {			namespace llvm {
	/// Returns true if LLVM is compiled with support for multi-threading, and			/// Returns true if LLVM is compiled with support for multi-threading, and
	/// false otherwise.			/// false otherwise.
	bool llvm_is_multithreaded();			bool llvm_is_multithreaded();

	/// llvm_execute_on_thread - Execute the given \p UserFn on a separate			/// llvm_execute_on_thread - Execute the given \p UserFn on a separate
	/// thread, passing it the provided \p UserData.			/// thread, passing it the provided \p UserData.
	///			///
	/// This function does not guarantee that the code will actually be executed			/// This function does not guarantee that the code will actually be executed
	/// on a separate thread or honoring the requested stack size, but tries to do			/// on a separate thread or honoring the requested stack size, but tries to do
	/// so where system support is available.			/// so where system support is available.
	///			///
	/// \param UserFn - The callback to execute.			/// \param UserFn - The callback to execute.
	/// \param UserData - An argument to pass to the callback function.			/// \param UserData - An argument to pass to the callback function.
	/// \param RequestedStackSize - If non-zero, a requested size (in bytes) for			/// \param RequestedStackSize - If non-zero, a requested size (in bytes) for
	/// the thread stack.			/// the thread stack.
	void llvm_execute_on_thread(void (UserFn)(void), void *UserData,			void llvm_execute_on_thread(void (UserFn)(void), void *UserData,
	unsigned RequestedStackSize = 0);			unsigned RequestedStackSize = 0);

				/// \brief Execute the function specified as a template parameter once.
				///
				/// Calls \p UserFn once ever. The call uniqueness is based on the address of
				/// the function passed in via the template arguement. This means no matter how
				/// many times you call llvm_call_once<foo>() in the same or different
				/// locations, foo will only be called once.
				///
				/// Typical usage:
				/// \code
				/// void foo() {...};
				/// ...
				/// llvm_call_once<foo>();
				/// \endcode
				///
				/// \tparam UserFn Function to call once.
				template <void (*UserFn)(void)> void llvm_call_once() {
				#if defined(LLVM_ON_UNIX)
				static std::once_flag flag;
				std::call_once(flag, UserFn);
				#else
				// Due to a bug in Windows 7's C++ runtime std::call_once cannot be called
				// during a static initializer. To workaround this the hand rolled solution
				// below uses thread primitives to implement call_once semantics.
				// Visual Studio bug id #811192.
				static volatile sys::cas_flag initialized = 0;
				majnemerUnsubmitted Not Done Reply Inline Actions `volatile` doesn't really make sense here. majnemer: `volatile` doesn't really make sense here.
				beanzAuthorUnsubmitted Not Done Reply Inline Actions Why not? Having this be volatile actually alleviates the possibility of the compiler optimizing away loads. beanz: Why not? Having this be volatile actually alleviates the possibility of the compiler optimizing…
				sys::cas_flag old_val = sys::CompareAndSwap(&initialized, 1, 0);
				majnemerUnsubmitted Not Done Reply Inline Actions This is very expensive, you are performing a RMW operation even if nothing has changed. majnemer: This is very expensive, you are performing a RMW operation even if nothing has changed.
				beanzAuthorUnsubmitted Not Done Reply Inline Actions Sure. For performance this should only execute if not already initialized. beanz: Sure. For performance this should only execute if not already initialized.
				if (old_val == 0) {
				UserFn();
				sys::MemoryFence();
				initialized = 2;
				}
				else {
				sys::cas_flag tmp = initialized;
				sys::MemoryFence();
				while (tmp != 2) {
				tmp = initialized;
				sys::MemoryFence();
				majnemerUnsubmitted Not Done Reply Inline Actions Let's assume that loading `initialized` will somehow turn into a relaxed load. Why bother with the fence? You could just fence after: while (initialized != 2); sys::MemoryFence(); This will result in far, far fewer fences should this ever be contended. majnemer: Let's assume that loading `initialized` will somehow turn into a relaxed load. Why bother with…
				beanzAuthorUnsubmitted Not Done Reply Inline Actions The point of making initialized volatile is so that the compiler will emit a load for the while loop. If you're loading with each iteration of the loop is the fence really needed at all? It shouldn't be for correctness. beanz: The point of making initialized volatile is so that the compiler will emit a load for the while…
				}
				}
				#endif
				}
	}			}

	#endif			#endif

lib/Support/ManagedStatic.cpp

Show All 10 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/Support/ManagedStatic.h"		#include "llvm/Support/ManagedStatic.h"
#include "llvm/Config/config.h"		#include "llvm/Config/config.h"
#include "llvm/Support/Atomic.h"		#include "llvm/Support/Atomic.h"
#include "llvm/Support/Mutex.h"		#include "llvm/Support/Mutex.h"
#include "llvm/Support/MutexGuard.h"		#include "llvm/Support/MutexGuard.h"
		#include "llvm/Support/Threading.h"
#include <cassert>		#include <cassert>
using namespace llvm;		using namespace llvm;

static const ManagedStaticBase *StaticList = nullptr;		static const ManagedStaticBase *StaticList = nullptr;
		static sys::Mutex *ManagedStaticMutex = nullptr;

static sys::Mutex& getManagedStaticMutex() {		static void initializeMutex() {
		ManagedStaticMutex = new sys::Mutex();
		}

		static sys::Mutex* getManagedStaticMutex() {
// We need to use a function local static here, since this can get called		// We need to use a function local static here, since this can get called
// during a static constructor and we need to guarantee that it's initialized		// during a static constructor and we need to guarantee that it's initialized
// correctly.		// correctly.
static sys::Mutex ManagedStaticMutex;		llvm_call_once<initializeMutex>();
return ManagedStaticMutex;		return ManagedStaticMutex;
}		}

void ManagedStaticBase::RegisterManagedStatic(void (Creator)(),		void ManagedStaticBase::RegisterManagedStatic(void (Creator)(),
void (Deleter)(void)) const {		void (Deleter)(void)) const {
assert(Creator);		assert(Creator);
if (llvm_is_multithreaded()) {		if (llvm_is_multithreaded()) {
MutexGuard Lock(getManagedStaticMutex());		MutexGuard Lock(*getManagedStaticMutex());

if (!Ptr) {		if (!Ptr) {
void* tmp = Creator();		void* tmp = Creator();

TsanHappensBefore(this);		TsanHappensBefore(this);
sys::MemoryFence();		sys::MemoryFence();

// This write is racy against the first read in the ManagedStatic		// This write is racy against the first read in the ManagedStatic
Show All 33 Lines	void ManagedStaticBase::destroy() const {

// Cleanup.		// Cleanup.
Ptr = nullptr;		Ptr = nullptr;
DeleterFn = nullptr;		DeleterFn = nullptr;
}		}

/// llvm_shutdown - Deallocate and destroy all ManagedStatic variables.		/// llvm_shutdown - Deallocate and destroy all ManagedStatic variables.
void llvm::llvm_shutdown() {		void llvm::llvm_shutdown() {
MutexGuard Lock(getManagedStaticMutex());		MutexGuard Lock(*getManagedStaticMutex());

while (StaticList)		while (StaticList)
StaticList->destroy();		StaticList->destroy();
}		}

This is an archive of the discontinued LLVM Phabricator instance.

Removing the static initializer in ManagedStatic.cpp by using llvm_call_once to initialize the ManagedStatic mutex.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 15285

include/llvm/Support/Threading.h

lib/Support/ManagedStatic.cpp

This is an archive of the discontinued LLVM Phabricator instance.

Removing the static initializer in ManagedStatic.cpp by using llvm_call_once to initialize the ManagedStatic mutex.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 15285

include/llvm/Support/Threading.h

lib/Support/ManagedStatic.cpp

Removing the static initializer in ManagedStatic.cpp by using llvm_call_once to initialize the ManagedStatic mutex.
ClosedPublic