__cxa_thread_atexit_impl() isn't present on all platforms, for example
Android pre-6.0. This patch uses a weak symbol to detect _impl()
support, falling back to a pthread_key_t-based implementation.
Details
- Reviewers
mclow.lists danalbert jroelofs EricWF
Diff Detail
Event Timeline
src/cxa_thread_atexit.cpp | ||
---|---|---|
18–133 | Just simplicity. (There are some fringe benefits, like the ability to build against pre-2.18 glibc but have it detect ..._impl() support when run with newer libraries. Or magically supporting musl etc. if they add it, without having to recompile.) Would you rather keep the build-time checks for non-Android platforms? | |
36–47 | Nothing good! I'll add the necessary error handling. |
You should look at __thread_specific_ptr in libcxx's <thread>. It does a lot of these things in order to satisfy the requirements of notify_all_at_thread_exit, set_value_at_thread_exit, and make_ready_at_thread_exit.
@rmaprath has been doing some work to make the threading runtime library swappable. I don't recall if his work extended to libcxxabi or not, but I'll page him anyway.
This implementation of __cxa_thread_atexit doesn't interact nicely with shared libraries. The following sequence of events causes unloaded code to get invoked.
- Create thread 42
- Load library foo from thread 42
- Call a function with a thread_local object with a dtor.
- Unload library foo.
- Join thread 42
glibc does some extra work during __cxa_thread_atexit_impl to bump the reference count of the shared library so that the user's "unload" doesn't actually unload the library.
src/cxa_thread_atexit.cpp | ||
---|---|---|
40 | I think this is correct, but it needs some comments because it is not obvious what (or why) this is implemented this way. More specifically, document the cases where run_dtors is run because of ~DtorListHolder vs. the cases where run_dtors is run because of the callback registered at pthread_key_create. |
Had a look at it. One thing that stands out is that notify_all_at_thread_exit() and friends are supposed to be invoked *after* thread_local destructors. But the order that pthread_key destructors run in is unspecified. This could be worked around by waiting for the second iteration through pthread_key destructors before triggering ~__thread_struct_imp(). It looks like libstdc++ has a similar bug if ..._impl() isn't available.
@rmaprath has been doing some work to make the threading runtime library swappable. I don't recall if his work extended to libcxxabi or not, but I'll page him anyway.
<__threading_support>? Seems to be libc++-specific. There's a few other raw uses of pthreads in libc++abi.
This implementation of __cxa_thread_atexit doesn't interact nicely with shared libraries. The following sequence of events causes unloaded code to get invoked.
- Create thread 42
- Load library foo from thread 42
- Call a function with a thread_local object with a dtor.
- Unload library foo.
- Join thread 42
glibc does some extra work during __cxa_thread_atexit_impl to bump the reference count of the shared library so that the user's "unload" doesn't actually unload the library.
Yep. This is about as good as libc++abi can do on its own though. Note that libstdc++ has similar limitations if ..._impl() isn't available.
It also intentionally leaks the pthread key. Does the __thread_specific_ptr rationale hold for this change as well?
This implementation of __cxa_thread_atexit doesn't interact nicely with shared libraries. The following sequence of events causes unloaded code to get invoked.
- Create thread 42
- Load library foo from thread 42
- Call a function with a thread_local object with a dtor.
- Unload library foo.
- Join thread 42
glibc does some extra work during __cxa_thread_atexit_impl to bump the reference count of the shared library so that the user's "unload" doesn't actually unload the library.
Yep. This is about as good as libc++abi can do on its own though. Note that libstdc++ has similar limitations if ..._impl() isn't available.
I was going to tell you that this is implementable with dladdr (which I think Android has). Then I looked more at the "prevent unloading" side of things, and it looks like that requires digging into the library structures directly. Ugh.
Comment on the limitation in the source, but you don't need to change any code for this item.
Plan to get started with libc++abi as soon as I'm done with libc++, I've hit a couple of bumps with the latter, resolving them at the moment.
I don't see any new pthread dependencies introduced in this patch, so it sounds OK to me. Thanks for the ping though!
The patch itself looks OK to me. Will need approval from @mclow.lists or @EricWF.
src/cxa_thread_atexit.cpp | ||
---|---|---|
18–133 | I think that this should be an opt-in mechanism, there are platforms that presumably never need to pay the cost of the unused code (macOS comes to mind). |
Hmm, maybe? If other global destructors run after ~DtorListHolder(), and they cause a thread_local to be initialized for the first time, __cxa_thread_atexit() might be called again. I was thinking that dtors would get re-initialized in that case but it appears it does not. So yeah, I think I'll need to leak the pthread_key_t.
I'm not sure how to avoid leaking the actual thread_local objects that get created in that situation. There's nothing left to trigger run_dtors() a second time.
src/cxa_thread_atexit.cpp | ||
---|---|---|
18–133 | This file is only built for UNIX AND NOT (APPLE OR CYGWIN). Other platforms use something other than __cxa_thread_atexit() I assume. |
Hmm, maybe? If other global destructors run after ~DtorListHolder(), and they cause a thread_local to be initialized for the first time, __cxa_thread_atexit() might be called again. I was thinking that dtors would get re-initialized in that case but it appears it does not. So yeah, I think I'll need to leak the pthread_key_t.
I'm not sure how to avoid leaking the actual thread_local objects that get created in that situation. There's nothing left to trigger run_dtors() a second time.
I'm not concerned about the loss of memory or pthread_key resources in this leak, as it is a very short-lived leak (the process is going away after all). We do need to have an idea of what happens with the destructor invocations for the other kinds of resources though.
I think the C++14 spec says what should happen.
3.6.3 Termination
- [...] The completions
of the destructors for all initialized objects with thread storage duration within that thread are sequenced
before the initiation of the destructors of any object with static storage duration. If the completion of the
constructor or dynamic initialization of an object with thread storage duration is sequenced before that of
another, the completion of the destructor of the second is sequenced before the initiation of the destructor
of the first. If the completion of the constructor or dynamic initialization of an object with static storage
duration is sequenced before that of another, the completion of the destructor of the second is sequenced
before the initiation of the destructor of the first.
What that means for this implementation is that I think that _cxa_thread_atexit is allowed to be called during run_dtors. If running the dtor for a thread local variable 'cat', we encounter a previously unseen thread_local 'dog', the compiler will call the ctor, then register the dtor with _cxa_thread_atexit. Since it is the most recently constructed thread local object, I would expect the 'dog' dtor to be the next dtor to be run. You may be able to support this just by moving "elem = elem->next" below the dtor invocation.
On the topic of __cxa_thread_atexit, was it ever specified how it interacts with things like thread cancellation?
src/cxa_thread_atexit.cpp | ||
---|---|---|
45 | run_dtors() is called when/if libc++.so gets unloaded... but only for the thread calling dlclose()? |
I don't think it's officially specified anywhere. C++ threads don't have a cancel method. The POSIX spec doesn't speak about the C++ ABI. The Itanium ABI could talk about this, but hasn't yet.
I think this implementation does the right thing with regards to cancellation though. POSIX says that first cancellation cleanup handlers are called, then thread-specific data destructors are called. pthread_cancel is still a really bad idea due to how it (doesn't) interact with RAII, but at least TLS data won't get leaked.
src/cxa_thread_atexit.cpp | ||
---|---|---|
45 | Most of the dtor magic is on the pthread_key_create side. pthreads lets you register a per-thread destructor. This destructor is only run on process termination (I think). |
src/cxa_thread_atexit.cpp | ||
---|---|---|
45 | I meant the call from ~DtorListHolder() |
Fixed some corner cases regarding destruction order and very-late-initialized thread_locals. Explicitly documented the known limitations compared to __cxa_thread_atexit_impl().
It wasn't quite that easy (have to re-look at the pthread_key to get newly added thread_locals), but that's done in the latest patch.
src/cxa_thread_atexit.cpp | ||
---|---|---|
38 | See http://stackoverflow.com/q/38130185/502399 for a test case that would trigger this. This may not be necessary depending on the answer to that question. | |
45 | This has changed somewhat in the latest patch, but the gist is similar. If libc++abi.so is dlclose()d, there had better not be any still-running threads that expect to execute thread_local destructors (or any other C++ code, for that matter). In the usual case (libc++abi.so loaded at startup, not by a later dlopen()), the last run_dtors() call happens as the final thread is exiting. |
src/cxa_thread_atexit.cpp | ||
---|---|---|
47 | Why are we doing this? I can see it being a little useful when debugging / developing, so that you get an early warning that something has gone wrong, but it seems like this will always be setting a value to the value it already has. | |
54 | Maybe this concern is unfounded, but I'm not overly fond of pthread_getspecific and setspecific in a loop. I've always been under the impression that those functions are rather slow. Could we add a layer of indirection so that we don't need to call getspecific and setspecific so often? Basically make the pointer that is directly stored in TLS an immutable pointer to pointer. |
Also, can you add test cases for a lot of these things? I don't expect test cases for the DSO side of things, but a lot of the tricky atexit cases should be covered.
Added a test case for destructor ordering. Got rid of pthread_{get,set}specific in a loop.
src/cxa_thread_atexit.cpp | ||
---|---|---|
23 | I'm going to have to agree with @majnemer. I think that the config check for LIBCXXABI_HAS_CXA_THREAD_ATEXIT_IMPL should stay in place. If cxa_thread_atexit_impl exists, then all of the fallback code can disappear at preprocessing time. We do lose out on the minor benefit of avoiding some libc++ recompiles, but we also avoid code bloat. For what it's worth, I'm willing to keep the weak symbol check in place if __cxa_thread_atexit_impl isn't present, I just don't want to pay for the fallback when I know I'm not going to use it. | |
47 | The loop doesn't read pthread_getspecific anymore. I get the need for the setspecific call here for your previous design, but I don't think it's needed now. | |
test/thread_local_destruction_order.pass.cpp | ||
2 | Nit: file name is wrong here. | |
49 | Can we have a CreatesThreadLocalInDestructor in the thread_fn as well? That way we can test both the main function and a pthread. If I understand your code and comments correctly, those go through different code paths. | |
55 | In the places where you can, validate that dtors actually are getting called. This may be your only place where you can do that. So something like 'assert(seq == 1)' here. |
src/cxa_thread_atexit.cpp | ||
---|---|---|
23 | Makes sense, I'll do that. | |
47 | __cxa_thread_atexit() calls pthread_getspecific(), so it's still needed. Otherwise it would create a new list instead of adding to the current one, and the ordering would be wrong. | |
test/thread_local_destruction_order.pass.cpp | ||
49 | Yep, meant to do that actually! | |
55 | Sounds good. What I wanted to do was print some output in the destructors, and check for a certain expected output. But I couldn't figure out how to do that with lit. |
test/thread_local_destruction_order.pass.cpp | ||
---|---|---|
55 | Normally, you would do that by piping the output to the llvm FileCheck utility. My unconfirmed suspicion is that that approach will run into difficulties because of libcxx and libcxxabi specific setups. I think just having the global object's dtor check is good enough for the final post condition though. I'm not terribly worried about global dtors malfunctioning. |
- Bring back HAVE___CXA_THREAD_ATEXIT_IMPL, and avoid the weak symbol/fallback implementation in that case
- Fix a leak in an error path
- Add a CreatesThreadLocalInDestructor to a non-main thread in the destructor ordering test
LGTM (with a comment nit), but you'll need to get approval from @EricWF or @mclow.lists.
I would like some of the information from your stack overflow post to make it's way to the comments. In particular, I think I would like to see it documented that we have made a choice for some undefined behavior.
Update comments to mention that late-initialized thread_locals invoke undefined behavior.
Is there a way to do that through Phabricator? Or did you mean to email them directly? (Not sure what their emails are but I can probably figure it out from the list history.)
The "@" will do a ping through phabricator, but a direct email is probably going to be your best bet at this point.
We can perform far fewer calls to pthread_getspecific/pthread_setspecific if we represent the list head using a global __thread DtorList* list_head = nullptr.
This also allows us to avoid the hack of setting/unsetting the key during run_dtors() which I really do not like.
Here is a patch that applies such changes: https://gist.github.com/EricWF/a071376b1216aabdd1695eec2175c374
What do you think of this idea?
src/cxa_thread_atexit.cpp | ||
---|---|---|
42 | Can you clarify what you mean by "other threads"? How is libc++ supposed to detect and handle this problem? |
You'll have to guard it against all the platforms that don't support TLS. Darwin 10.6 is one of them.
Which is fine because we shouldn't supply a definition of __cxa_thread_atexit on those platforms.
Makes sense to me, I'll integrate it into the next revision.
src/cxa_thread_atexit.cpp | ||
---|---|---|
42 | I meant "non-main threads" ("other" is in relation to the bullet point above), but I can clarify this, sure. libc++ could be patched to do something like this: pthread_key_t key1, key2; void destructor1(void* ptr) { pthread_setspecific(key2, ptr); } void destructor2(void* ptr) { // Runs in the second iteration through pthread_key destructors, // therefore after thread_local destructors } pthread_key_create(&key1, destructor1); pthread_key_create(&key2, destructor2); pthread_setspecific(key1, ptr); (Or it could use a counter/flag and a single pthread_key.) libstdc++ has the same bug when __cxa_thread_atexit_impl() isn't available, so I'm not sure that change would really be necessary. If it is, I can write up the libc++ patch. |
src/cxa_thread_atexit.cpp | ||
---|---|---|
42 | I don't think we need to patch this in libc++. Especially because it would be incorrect in the vast majority of cases. |
LGTM modulo bug fix.
src/cxa_thread_atexit.cpp | ||
---|---|---|
70 | There is a bug here. If head->next == nullptr and if head->dtor(head->obj)) creates a TL variable in the destructor then that destructor will not be invoked. Here's an updated test case which catches the bug: https://gist.github.com/EricWF/3bb50d4f28b91aa28d2adefea0e94a0e |
src/cxa_thread_atexit.cpp | ||
---|---|---|
70 | I can't reproduce that failure here, your exact test case passes (even with #undef HAVE___CXA_THREAD_ATEXIT_IMPL and the weak symbol test commented out). Tracing the implementation logic, it seems correct. If head->next == nullptr then this line does dtors = nullptr. Then if head->dtor(head->obj) registers a new thread_local, __cxa_thread_atexit() does head = malloc(...); ... dtors = head;. Then the next iteration of the loop while (auto head = dtors) { picks up that new node. Have I missed something? |
src/cxa_thread_atexit.cpp | ||
---|---|---|
70 | I can't reproduce this morning either, I must have been doing something funny. I'll look at this with a fresh head tomorrow. If I can't find anything this will be good to go. Thanks for working on this. |
src/cxa_thread_atexit.cpp | ||
---|---|---|
70 | No problem! I can integrate your updated test case anyway if you want. |
LGTM after addressing the inline comments.
src/cxa_thread_atexit.cpp | ||
---|---|---|
70 | Yeah I would like to see the upgraded test case applied. At least that way we're testing the case in question. So I agree with your above analysis of what happens, and that all destructors are correctly called during the first iteration of pthread key destruction. My one issues is that we still register a new non-null key which forces pthread to run the destructor for the key again. I would like to see this fixed. |
Integrated @EricWF's expanded test case, and avoid an unneeded pthread_setspecific() call if the last thread_local's destructor initializes a new thread_local.
src/cxa_thread_atexit.cpp | ||
---|---|---|
70 | Yep, done! I guess that was the point of thread_alive from your original patch-to-my-patch, sorry for stripping it out. |
LGTM. Thanks for the patch.
src/cxa_thread_atexit.cpp | ||
---|---|---|
70 | Na I thought I was solving a bug that didn't exist. You give me too much credit. |
@tavianator: I'm about to commit D24864, which will affect this patch (indirectly). D24864 basically refactors all pthread dependencies behind a separate API. It would be pretty straightforward for you to update this patch though, just replacing pthread calls with ones in thread_support.h (perhaps adding anything missing).
Hope you don't mind me going first? If you are going to commit this soon, I can hold off D24864. Let me know.
Cheers,
/ Asiri
@rmaprath I'll merge this if needed. Feel free to commit your patch first.
@tavianator Do you need somebody to merge this for you?
Yeah, @rmaprath I'm happy to rebase this over your patch.
@tavianator Do you need somebody to merge this for you?
I assume so, yeah.
My patch got a bit stuck downstream :(
So you / @EricWF can go ahead with this patch, I'll rebase mine over yours when I'm ready to commit.
/ Asiri
I'm going to have to agree with @majnemer. I think that the config check for LIBCXXABI_HAS_CXA_THREAD_ATEXIT_IMPL should stay in place. If cxa_thread_atexit_impl exists, then all of the fallback code can disappear at preprocessing time.
We do lose out on the minor benefit of avoiding some libc++ recompiles, but we also avoid code bloat.
For what it's worth, I'm willing to keep the weak symbol check in place if __cxa_thread_atexit_impl isn't present, I just don't want to pay for the fallback when I know I'm not going to use it.