This is an archive of the discontinued LLVM Phabricator instance.

Removing the static initializer in ManagedStatic.cpp by using llvm_call_once to initialize the ManagedStatic mutex.
ClosedPublic

Authored by beanz on Oct 22 2014, 4:59 PM.

Details

Summary

This patch adds an llvm_call_once which is a wrapper around std::call_once on platforms where it is available and devoid of bugs. The patch also migrates the ManagedStatic mutex to be allocated using llvm_call_once.

These changes are philosophically equivalent to the changes added in r219638, which were reverted due to a hang on Win32 which was the result of a bug in the Windows implementation of std::call_once.

Diff Detail

Event Timeline

beanz updated this revision to Diff 15285.Oct 22 2014, 4:59 PM
beanz retitled this revision from to Removing the static initializer in ManagedStatic.cpp by using llvm_call_once to initialize the ManagedStatic mutex..
beanz updated this object.
beanz edited the test plan for this revision. (Show Details)
beanz added reviewers: chandlerc, rnk, aaron.ballman, chapuni.
beanz added a subscriber: Unknown Object (MLST).
beanz updated this revision to Diff 15332.Oct 23 2014, 9:34 AM

Fixed code comments based on Aaron's feedback.

Aaron, I understand your point about using the Windows one-time initialization, however I'm really not the best candidate to write Windows specific code. I am much more familiar with basic threading constructs like MemFence and CompareAndSwap than Win32 APIs. Could you provide and test a Windows one-time init solution?

Thanks,
-Chris

rnk edited edge metadata.Oct 23 2014, 10:10 AM
In D5922#4, @beanz wrote:

Fixed code comments based on Aaron's feedback.

Aaron, I understand your point about using the Windows one-time initialization, however I'm really not the best candidate to write Windows specific code. I am much more familiar with basic threading constructs like MemFence and CompareAndSwap than Win32 APIs. Could you provide and test a Windows one-time init solution?

The one time initialization APIs were added in Vista, and I don't know if we've dropped support for XP or not yet. Certainly Microsoft has. =/

I suspect that the fence and CAS implementation has subtle issues, but I'm not a C++ memory model expert. David Majnemer has a better understanding, and he said there are issues.

When I pointed him at this, he told me the template interface I recommended was a bad idea. Maybe having llvm::once_flag and llvm::call_once that alias the std:: variants when available is the way to go. WDYT?

I can re-work this to be an alias of call_once when available. I think in the absence of call_once the hand rolled solution should be sufficient. I'm curious what David thinks the issues with the hand rolled implementation are. I'm pretty good WRT lock-free multi-threading, and I don't spot anything. Also, since we're using pretty much that same code in pass initialization, if there are issues we should identify and solve them.

I would instead recommend we do something along the lines of:

#include <atomic>
#include <thread>

enum InitStatus {
  Done = -1,
  Uninitialized = 0,
  Wait = 1
};

void call_once(std::atomic<InitStatus> &Initialized, void (*fptr)(void)) {
  InitStatus Status = Initialized.load(std::memory_order_relaxed);
  for (;;) {
    // Figure out which state we are in.
    if (Status == Done) {
      // We are at Done, no further transitions are possible.
      // This acquire fence pairs with the release fense below.
      std::atomic_thread_fence(std::memory_order_acquire);
      return;
    }
    if (Status == Wait) {
      // We are waiting for a Wait -> Done transition.
      std::this_thread::yield();
      // Reload status.
      Status = Initialized.load(std::memory_order_relaxed);
      continue;
    }
    // Attempt an Uninitialized -> Wait transition.
    // If we succeed, perform initialization.
    // If we do not, Status will be updated with our new state.
    Status = Uninitialized;
    bool ShouldInit = std::atomic_compare_exchange_weak_explicit(
        &Initialized, &Status, Wait, std::memory_order_relaxed,
        std::memory_order_relaxed);
    if (ShouldInit) {
      (*fptr)();
      std::atomic_thread_fence(std::memory_order_release);
      Initialized.store(Done, std::memory_order_relaxed);
    }
  }
}

This approach ends up with no fences in the actual assembly with either GCC or clang when targeting X86-64. It has the additional benefit of not dirtying any cache lines if initialization has occurred long before.

include/llvm/Support/Threading.h
70

volatile doesn't really make sense here.

71

This is very expensive, you are performing a RMW operation even if nothing has changed.

77–82

Let's assume that loading initialized will somehow turn into a relaxed load. Why bother with the fence? You could just fence after:

while (initialized != 2);
sys::MemoryFence();

This will result in far, far fewer fences should this ever be contended.

beanz added a comment.Oct 23 2014, 5:09 PM

It sounds like David's complaints are performance related not correctness. Unfortunately his suggested implementation doesn't really cover the needs for this. The hand-rolled solution is basically filling in on Windows where we aren't guaranteed to have <mutex>, <thread>, or <atomic>.

There are definitely more performant ways to implement the hand rolled solution, but it should be safe as implemented. That said, I will make adjustments to the implementation and submit updated patches for review.

include/llvm/Support/Threading.h
70

Why not? Having this be volatile actually alleviates the possibility of the compiler optimizing away loads.

71

Sure. For performance this should only execute if not already initialized.

77–82

The point of making initialized volatile is so that the compiler will emit a load for the while loop. If you're loading with each iteration of the loop is the fence really needed at all?

It shouldn't be for correctness.

In D5922#10, @beanz wrote:

It sounds like David's complaints are performance related not correctness.

I'm afraid not. Because your implementation uses operations which are not atomic, it has undefined behavior in the eyes of C++11.

C++11 [intro.multithread]p21:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

Unfortunately his suggested implementation doesn't really cover the needs for this. The hand-rolled solution is basically filling in on Windows where we aren't guaranteed to have <mutex>, <thread>, or <atomic>.

Is there a platform where we can't use <atomic>? I'm aware of platforms where we cannot use <thread> or <mutex> but we should use <atomic> if we can. My call_once implementation only used <thread> for yield; if we removed it, we wouldn't need anything other than <atomic>.

beanz added a comment.Oct 27 2014, 1:51 PM

A few notes.

According to Chandler's related RFC (http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-September/077081.html) we do not have a requirement that all platforms support <atomic>.

In D5922#11, @majnemer wrote:

I'm afraid not. Because your implementation uses operations which are not atomic, it has undefined behavior in the eyes of C++11.

C++11 [intro.multithread]p21:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

I think you're misconstruing what this means. Your statement would seem to imply that until C++11 nobody ever managed to write thread safe code-- which is absurd. There are certainly undefined behaviors in the implementation I've provided, but they are known, and appropriately handled.

Both the MemFences, and the tmp in the else branch of the original patch can be safely removed, and the code can be re-written into a while loop for performance. An example implementation would be:

enum InitStatus {
  Done = -1,
  Uninitialized = 0,
  Wait = 1
};

void llvm::call_once(once_flag& flag, void (*UserFn)(void)) {
  while (flag != Done) {
    if (flag == Wait) {
      ::Sleep(1);
      continue;
    }

    sys::cas_flag old_val = sys::CompareAndSwap(&flag, Wait, Uninitialized);
    if (old_val == Uninitialized) {
      UserFn();
      sys::MemoryFence();
      flag = Done;
    }
  }
}

I (and some of my colleagues) believe that this is perfectly safe code, and it eliminates unnecessary MemFences (which you raised concerns about). Do you see a problem with this implementation (other than the fact that it doesn't use std::atomic)?

I don't have any particular opposition to using std::atomic, however since we don't have an enforced requirement I'm not sure we can assume we have it. Please keep in mind I think we still support deploying back to Windows XP.

beanz updated this revision to Diff 15529.Oct 28 2014, 3:14 PM
beanz edited edge metadata.

New patches based on feedback.

Updates in these patches:

  • Broke Windows and Unix code out into separate Windows.inc files.
  • Windows implementation uses <atomic> implementation provided by majnemer.

These patches are tested on Windows and Unix.

All the atomic stuff looks great to me.

rnk added a comment.Oct 30 2014, 11:08 AM

So, bad news, std::atomic<int> has a non-trivial default constructor in the MSVC STL. This means the code void foo() { static std::atomic<int> flag; } uses a non-threadsafe guard variable to initialize flag. Worse, the guard variable is a bitfield, so it is very much not threadsafe. See the code for foo here:

  mov         eax,dword ptr [?$S1@?1??foo@@YAXXZ@4IA]            
  test        al,1                                               
  jne         initialized
  or          eax,1                                              
  mov         ecx,offset ?flag@?1??foo@@YAXXZ@4U?$atomic@H@std@@A
  mov         dword ptr [?$S1@?1??foo@@YAXXZ@4IA],eax            
  xor         eax,eax                                            
  xchg        eax,dword ptr [ecx]                                
initialized:
  ret

So... I'm running out of ideas. :( We can wait 3 years and wait for MSVC 14 to be our baseline, and then the problem of thread-safe static initialization will go away.

If I understand correctly this can still be safe as long as the flag is defined as a static global right? It will keep a static initializer around on Windows, but it should be safe.

-Chris

rnk added a comment.Oct 30 2014, 11:17 AM

I don't think it's that safe.

First, if someone accesses a ManagedStatic before main (this happens today,
right?), then we have no guarantee that the flag has been initialized
because initialization order of globals is arbitrary. The global will start
out zero, but then it will be re-zeroed out later and we'll get double
initialization. =(

Second, it's hard to ensure that all callers actually use global statics
instead of static locals.

BTW, MSVC 14 *also* uses "= default" for std::atomic's default constructor,
which solves the problem with the std::atomic double checked locking
approach.

rnk added a comment.Oct 30 2014, 11:19 AM

Also, I thought the motivation of this work was to eliminate dynamic
initialization for WebKit. Does WebKit no longer care about dynamic
initialization on Windows?

Ugh... So, std::atomic is off the table.

I think we can do this instead:

void llvm::call_once(once_flag &Initialized, void (*fptr)(void)) {
  while (flag != Done) {
    if (flag == Wait) {
      ::Sleep(1);
      continue;
    }

    sys::cas_flag old_val = sys::CompareAndSwap(&flag, Wait, Uninitialized);
    if (old_val == Uninitialized) {
      fptr();
      sys::MemoryFence();
      flag = Done;
    }
  }
  sys::MemoryFence();
}

It ends up being a bit conservative on MemFences on x86, but it should be safe, and since once_flag is just an unsigned, it has a trivial static initializer and should avoid the issue that std::atomic has.

chandlerc edited edge metadata.Oct 30 2014, 12:42 PM

Can you instead use a windows specific api for implementing call once?

rnk added a comment.Oct 30 2014, 1:17 PM

Can you instead use a windows specific api for implementing call once?

The obvious API was added in Vista, so we'd have to drop support for
running LLVM on Windows XP. Given that official Microsoft support ended in
April, it's probably reasonable for us to do this.

XP will probably fall off the cart soon for other reasons. For example, if
we want new C++11 library features (std::thread!), we will need to use
newer CRTs which probably do not run on XP.

Pre-Vista MSDN basically recommends that you roll your own double checked
locking to avoid races due to static local initialization. =P

beanz added a comment.Oct 30 2014, 1:29 PM

While I recognize that Chandler's point is probably right that we should use the Windows API (caveat that we're ok with dropping running on XP as Reid pointed out).

Would it be reasonable for me to land the hand-rolled double checked solution instead, and have someone else (preferably someone more skilled with Windows) implement the Windows API based solution later?

rnk added a comment.Oct 30 2014, 1:32 PM

n Thu, Oct 30, 2014 at 1:29 PM, Chris Bieneman <beanz@apple.com> wrote:

While I recognize that Chandler's point is probably right that we should
use the Windows API (caveat that we're ok with dropping running on XP as
Reid pointed out).

Would it be reasonable for me to land the hand-rolled double checked
solution instead, and have someone else (preferably someone more skilled
with Windows) implement the Windows API based solution later?

Let's do that. It's a strict improvement over the single check static local
initializer we have currently.

In a week or so after the XP RFC has baked we can go back and do this
better.

beanz added a comment.Oct 30 2014, 1:35 PM

Cool. I'm testing new patches now and will post them as soon as I have them ready.

beanz updated this revision to Diff 15580.Oct 30 2014, 2:30 PM
beanz edited edge metadata.

Went back to the hand-rolled non-std::atomic version due to issues Reid pointed out.

rnk accepted this revision.Oct 30 2014, 2:54 PM
rnk edited edge metadata.

lgtm

Thanks, sorry for the runaround.

This revision is now accepted and ready to land.Oct 30 2014, 2:54 PM
Diffusion closed this revision.Oct 30 2014, 3:17 PM
Diffusion updated this revision to Diff 15586.

Closed by commit rL220932 (authored by cbieneman).

beanz added a comment.Oct 30 2014, 3:19 PM

I’ve landed the change in r220932. I will watch the bots to make sure everything goes smoothly.

Thanks everyone for all the help on this.
-Chris

This broke NetBSD:

http://lab.llvm.org:8011/builders/lldb-amd64-ninja-netbsd7/builds/4137

@beanz can you have a look?

I have reproduced it locally:

[ 28%] Built target llvm-tblgen
[ 28%] Built target not
Scanning dependencies of target intrinsics_gen
Scanning dependencies of target AttributeCompatFuncTableGen
Scanning dependencies of target LibOptionsTableGen
[ 28%] Building Intrinsics.gen...
[1]   Segmentation fault (core dumped) ../../../bin/llv...
--- include/llvm/IR/Intrinsics.gen.tmp ---
*** [include/llvm/IR/Intrinsics.gen.tmp] Error code 139

make[2]: stopped in /tmp/pkgsrc-tmp/wip/llvm-git/work/build

Ouch, it was revert of revert, so I'm not sure what is the current Phabricator issue for it.