This is an archive of the discontinued LLVM Phabricator instance.

[libc++][mingw] TLS cleanup on windows
AbandonedPublic

Authored by kalle-llvm on Jul 14 2023, 2:30 AM.

Details

Reviewers
mstorsjo
Group Reviewers
Restricted Project
Restricted Project
Summary

On windows, when libc++ is statically linked into a dll the tls-key used for __cxa_eh_globals must be freed when the dll is unloaded. Otherwise the callback registered for the tls-key will remain in place but point to non-existing code and the process will crash as soon as a thread exits.

This will happen if libc++ is statically linked into the dll or if the main application doesn't use libc++ (so that is is unloaded when the dll is unloaded).

I've made a function __libcpp_tls_destroy() that is registered with atexit() that frees the tls-key. I did this conditionally on _LIBCPP_HAS_THREAD_API_WIN32 since I'm not sure how this should be handled for the other thread models.

I think a similar problem exists with the tls-key in __thread_specific_ptr, but a comment in the destructor (thread.h line 103) says that the missing cleanup is intentional in the pthreads case. And this destructor is inlined so I guess adding a cleanup there would be an ABI break?

Diff Detail

Event Timeline

kalle-llvm created this revision.Jul 14 2023, 2:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2023, 2:30 AM
kalle-llvm requested review of this revision.Jul 14 2023, 2:30 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJul 14 2023, 2:30 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
Herald added a reviewer: Restricted Project. · View Herald Transcript
jrtc27 added a subscriber: jrtc27.Jul 14 2023, 3:35 AM

Isn't atexit for process termination? I'd be surprised if it really does run during FreeLibrary. The normal (/documented) approach is to handle DLL_PROCESS_DETACH inside DllMain, but many applications also seem to use a C++ destructor, which will compose better with static linking.

I'll also note libcxxabi/src/cxa_thread_atexit.cpp has some documented hairiness around libcpp_tls_create, and that both it and thread_specific_ptr are users of libcpp_tls_create that presumably have similar issues to the one discussed here for cxa_get_globals.

Isn't atexit for process termination? I'd be surprised if it really does run during FreeLibrary. The normal (/documented) approach is to handle DLL_PROCESS_DETACH inside DllMain, but many applications also seem to use a C++ destructor, which will compose better with static linking.

On Windows atexit functions run on dll unload, and global destructors are implemented with atexit. I considered using a global destructor but I wanted to make as small a change as possible.

And it does work, at least for my use case with static linking of libc++/libc++abi/libunwind.

Isn't atexit for process termination? I'd be surprised if it really does run during FreeLibrary. The normal (/documented) approach is to handle DLL_PROCESS_DETACH inside DllMain, but many applications also seem to use a C++ destructor, which will compose better with static linking.

As @kalle-llvm already mentioned, atexit on Windows actually is hooked up to the DllMain handler for DLL_PROCESS_DETACH, and this is the underlying mechanism used for C++ destructors already. Handlers set up with atexit are executed either when the process exits or the current DLL gets unloaded, whichever happens first.

I'll also note libcxxabi/src/cxa_thread_atexit.cpp has some documented hairiness around __libcpp_tls_create, and that both it and __thread_specific_ptr are users of __libcpp_tls_create that presumably have similar issues to the one discussed here for __cxa_get_globals.

Thanks for the pointers. For this specific use (which is specific to mingw usecases; it's windows-only, and libcxxabi isn't used in MSVC-like environments, only in itanium environments) I think it might be ok to rely on details about the actual orderings on windows/mingw.

@kalle-llvm Am I right in assuming, that the main ordering problem to be concerned about here, is when this atexit-registered function is called when a process is exiting (as opposed to when the enclosing DLL gets unloaded), whether the destroy_ function, calling FlsFree, gets called after the tls slot destructor itself already has fired (as I presume it does on shutdown)? I guess relevant cases to test is both when libc++ lives in its own DLL and when it is statically linked into the main executable (they have slightly different destructor execution orderings). If we can see that duplicate execution is safe here, or that we're sure that the atexit callback gets called before the tls slot destructors for the main thread, then this should probably be fine.

libcxx/include/__threading_support
244

Ok, so this is a new ABI entry in libcxx, but which only would be provided on certain platforms.

On one hand, this doesn't affect other platforms and their ABIs, and this is private API (used by libcxxabi) so it should probably be fine. But should we wrap it in a suitable ifdef to make it abundantly clear that it only is available in some cases?

I guess relevant cases to test is both when libc++ lives in its own DLL and when it is statically linked into the main executable (they have slightly different destructor execution orderings). If we can see that duplicate execution is safe here, or that we're sure that the atexit callback gets called before the tls slot destructors for the main thread, then this should probably be fine.

@mstorsjo You're right, there are several configurations that should be tested. I'm currently on vacation and will be away for another two weeks. When I'm back I will make the tests and report back.

I've also been thinking about what happens if a global destructor uses exception handling internally after the tls slot has been freed (if it comes later in the atexit list). As the patch stands now, it will pick up a stale tls slot which is obviously not good, so the destroy function should really zero out the tls key after it frees it. Then maybe __cxa_get_globals will trigger a new tls slot allocation (with a new atexit destructor) and things will just work. Or not. I will try to investigate this in more detail.

I guess relevant cases to test is both when libc++ lives in its own DLL and when it is statically linked into the main executable (they have slightly different destructor execution orderings). If we can see that duplicate execution is safe here, or that we're sure that the atexit callback gets called before the tls slot destructors for the main thread, then this should probably be fine.

@mstorsjo You're right, there are several configurations that should be tested. I'm currently on vacation and will be away for another two weeks. When I'm back I will make the tests and report back.

I've also been thinking about what happens if a global destructor uses exception handling internally after the tls slot has been freed (if it comes later in the atexit list). As the patch stands now, it will pick up a stale tls slot which is obviously not good, so the destroy function should really zero out the tls key after it frees it. Then maybe __cxa_get_globals will trigger a new tls slot allocation (with a new atexit destructor) and things will just work. Or not. I will try to investigate this in more detail.

Awesome, sounds good if you can take the time to investigate the circumstances properly - no rush! Thanks!

Unfortunately I have not been able to come up with a race-free way to clean up the tls key and still allow exception handling to keep working for code that runs after the cleanup (like destructors of global variables which can run concurrently).

However, in cxa_exception_storage.cpp there is another implementation of __cxa_get_globals() that is ifdef-ed by HAS_THREAD_LOCAL. This doesn't seem to be documented anywhere, but searching for it I found this which is the exact same bug. And turning on HAS_THREAD_LOCAL fixes it for me too.

@mstorsjo So thread_local seems to work on mingw, are there any gotchas? If not, perhaps HAS_THREAD_LOCAL should be turned on (at least on mingw, but it seems to me this bug should be possible to trigger on other platforms).

(This still leaves the issue with __thread_specific_pointer, but at least that can be worked around by not using std::thread.)

However, in cxa_exception_storage.cpp there is another implementation of __cxa_get_globals() that is ifdef-ed by HAS_THREAD_LOCAL. This doesn't seem to be documented anywhere, but searching for it I found this which is the exact same bug. And turning on HAS_THREAD_LOCAL fixes it for me too.

Weird. It seems like this codepath has existed in tree since it was added in 1df50ca6a2f60b3b60ca6adc1ab98f33184c3597 in 2011, without any way of it becoming enabled other than manually defining this macro. There seems to have been a recent attempt to enable it in November 2022, in 8271aa5335668a1dc62168a4e90b4554bd3a0ca7, which struck issues and was reverted a couple days later in bc91104e85b7a233a8966998f387b2222f350f6e. See https://reviews.llvm.org/D138461 for the full story.

@mstorsjo So thread_local seems to work on mingw, are there any gotchas? If not, perhaps HAS_THREAD_LOCAL should be turned on (at least on mingw, but it seems to me this bug should be possible to trigger on other platforms).

Hmm, I've paged out this matter from my mind again, so I'm not sure I have all the concerns in mind right now...

The thread local destructors do run at roughly the right time when they're supposed to, both when threads exit and when a DLL is unloaded. I guess there's a potential concern that there's no ordering between this thread local destructor and other thread local destructors. So I guess that'd mean that other thread local destructors can't use exception handling within their call stack, if they run after this particular dtor, as there's nothing to force their internal ordering - or is there?

The thread local destructors do run at roughly the right time when they're supposed to, both when threads exit and when a DLL is unloaded. I guess there's a potential concern that there's no ordering between this thread local destructor and other thread local destructors. So I guess that'd mean that other thread local destructors can't use exception handling within their call stack, if they run after this particular dtor, as there's nothing to force their internal ordering - or is there?

That's exactly what I thought at first, but then I realized that eh_globals is just a plain struct with no destructor. And it is not allocated with new it just lives in the thread_local memory block, so there should be no lifetime issues.

kalle-llvm abandoned this revision.Aug 13 2023, 8:52 AM

I guess I will mark this as abandoned since there is a better solution with thread local storage.