This is an archive of the discontinued LLVM Phabricator instance.

[libcxxabi] Provide a fallback __cxa_thread_atexit() implementation
ClosedPublic

Authored by tavianator on Jun 28 2016, 10:49 AM.

Details

Summary

__cxa_thread_atexit_impl() isn't present on all platforms, for example
Android pre-6.0. This patch uses a weak symbol to detect _impl()
support, falling back to a pthread_key_t-based implementation.

Diff Detail

Event Timeline

tavianator retitled this revision from to [libcxxabi] Provide a fallback __cxa_thread_atexit() implementation.
tavianator updated this object.
tavianator added reviewers: danalbert, jroelofs.
tavianator added a subscriber: cfe-commits.
majnemer added inline comments.
src/cxa_thread_atexit.cpp
18–142

Is there a reason to use the weak symbol on non-Android platforms?

36–47

What happens if pthread_key_create fails?

tavianator added inline comments.Jun 28 2016, 6:52 PM
src/cxa_thread_atexit.cpp
18–142

Just simplicity. (There are some fringe benefits, like the ability to build against pre-2.18 glibc but have it detect ..._impl() support when run with newer libraries. Or magically supporting musl etc. if they add it, without having to recompile.)

Would you rather keep the build-time checks for non-Android platforms?

36–47

Nothing good! I'll add the necessary error handling.

tavianator updated this revision to Diff 62218.Jun 29 2016, 6:36 AM

Added error handling for pthread_key uses

You should look at __thread_specific_ptr in libcxx's <thread>. It does a lot of these things in order to satisfy the requirements of notify_all_at_thread_exit, set_value_at_thread_exit, and make_ready_at_thread_exit.

@rmaprath has been doing some work to make the threading runtime library swappable. I don't recall if his work extended to libcxxabi or not, but I'll page him anyway.

This implementation of __cxa_thread_atexit doesn't interact nicely with shared libraries. The following sequence of events causes unloaded code to get invoked.

  • Create thread 42
  • Load library foo from thread 42
  • Call a function with a thread_local object with a dtor.
  • Unload library foo.
  • Join thread 42

glibc does some extra work during __cxa_thread_atexit_impl to bump the reference count of the shared library so that the user's "unload" doesn't actually unload the library.

src/cxa_thread_atexit.cpp
40

I think this is correct, but it needs some comments because it is not obvious what (or why) this is implemented this way.

More specifically, document the cases where run_dtors is run because of ~DtorListHolder vs. the cases where run_dtors is run because of the callback registered at pthread_key_create.

You should look at __thread_specific_ptr in libcxx's <thread>. It does a lot of these things in order to satisfy the requirements of notify_all_at_thread_exit, set_value_at_thread_exit, and make_ready_at_thread_exit.

Had a look at it. One thing that stands out is that notify_all_at_thread_exit() and friends are supposed to be invoked *after* thread_local destructors. But the order that pthread_key destructors run in is unspecified. This could be worked around by waiting for the second iteration through pthread_key destructors before triggering ~__thread_struct_imp(). It looks like libstdc++ has a similar bug if ..._impl() isn't available.

@rmaprath has been doing some work to make the threading runtime library swappable. I don't recall if his work extended to libcxxabi or not, but I'll page him anyway.

<__threading_support>? Seems to be libc++-specific. There's a few other raw uses of pthreads in libc++abi.

This implementation of __cxa_thread_atexit doesn't interact nicely with shared libraries. The following sequence of events causes unloaded code to get invoked.

  • Create thread 42
  • Load library foo from thread 42
  • Call a function with a thread_local object with a dtor.
  • Unload library foo.
  • Join thread 42

glibc does some extra work during __cxa_thread_atexit_impl to bump the reference count of the shared library so that the user's "unload" doesn't actually unload the library.

Yep. This is about as good as libc++abi can do on its own though. Note that libstdc++ has similar limitations if ..._impl() isn't available.

You should look at __thread_specific_ptr in libcxx's <thread>. It does a lot of these things in order to satisfy the requirements of notify_all_at_thread_exit, set_value_at_thread_exit, and make_ready_at_thread_exit.

Had a look at it. One thing that stands out is that notify_all_at_thread_exit() and friends are supposed to be invoked *after* thread_local destructors. But the order that pthread_key destructors run in is unspecified. This could be worked around by waiting for the second iteration through pthread_key destructors before triggering ~__thread_struct_imp(). It looks like libstdc++ has a similar bug if ..._impl() isn't available.

It also intentionally leaks the pthread key. Does the __thread_specific_ptr rationale hold for this change as well?

This implementation of __cxa_thread_atexit doesn't interact nicely with shared libraries. The following sequence of events causes unloaded code to get invoked.

  • Create thread 42
  • Load library foo from thread 42
  • Call a function with a thread_local object with a dtor.
  • Unload library foo.
  • Join thread 42

glibc does some extra work during __cxa_thread_atexit_impl to bump the reference count of the shared library so that the user's "unload" doesn't actually unload the library.

Yep. This is about as good as libc++abi can do on its own though. Note that libstdc++ has similar limitations if ..._impl() isn't available.

I was going to tell you that this is implementable with dladdr (which I think Android has). Then I looked more at the "prevent unloading" side of things, and it looks like that requires digging into the library structures directly. Ugh.

Comment on the limitation in the source, but you don't need to change any code for this item.

@rmaprath has been doing some work to make the threading runtime library swappable. I don't recall if his work extended to libcxxabi or not, but I'll page him anyway.

<__threading_support>? Seems to be libc++-specific. There's a few other raw uses of pthreads in libc++abi.

Plan to get started with libc++abi as soon as I'm done with libc++, I've hit a couple of bumps with the latter, resolving them at the moment.

I don't see any new pthread dependencies introduced in this patch, so it sounds OK to me. Thanks for the ping though!

The patch itself looks OK to me. Will need approval from @mclow.lists or @EricWF.

majnemer added inline comments.Jun 29 2016, 10:59 AM
src/cxa_thread_atexit.cpp
18–142

I think that this should be an opt-in mechanism, there are platforms that presumably never need to pay the cost of the unused code (macOS comes to mind).

It also intentionally leaks the pthread key. Does the __thread_specific_ptr rationale hold for this change as well?

Hmm, maybe? If other global destructors run after ~DtorListHolder(), and they cause a thread_local to be initialized for the first time, __cxa_thread_atexit() might be called again. I was thinking that dtors would get re-initialized in that case but it appears it does not. So yeah, I think I'll need to leak the pthread_key_t.

I'm not sure how to avoid leaking the actual thread_local objects that get created in that situation. There's nothing left to trigger run_dtors() a second time.

src/cxa_thread_atexit.cpp
18–142

This file is only built for UNIX AND NOT (APPLE OR CYGWIN). Other platforms use something other than __cxa_thread_atexit() I assume.

bcraig added a comment.EditedJun 29 2016, 1:06 PM

Hmm, maybe? If other global destructors run after ~DtorListHolder(), and they cause a thread_local to be initialized for the first time, __cxa_thread_atexit() might be called again. I was thinking that dtors would get re-initialized in that case but it appears it does not. So yeah, I think I'll need to leak the pthread_key_t.

I'm not sure how to avoid leaking the actual thread_local objects that get created in that situation. There's nothing left to trigger run_dtors() a second time.

I'm not concerned about the loss of memory or pthread_key resources in this leak, as it is a very short-lived leak (the process is going away after all). We do need to have an idea of what happens with the destructor invocations for the other kinds of resources though.

I think the C++14 spec says what should happen.

3.6.3 Termination

  1. [...] The completions

of the destructors for all initialized objects with thread storage duration within that thread are sequenced
before the initiation of the destructors of any object with static storage duration. If the completion of the
constructor or dynamic initialization of an object with thread storage duration is sequenced before that of
another, the completion of the destructor of the second is sequenced before the initiation of the destructor
of the first. If the completion of the constructor or dynamic initialization of an object with static storage
duration is sequenced before that of another, the completion of the destructor of the second is sequenced
before the initiation of the destructor of the first.

What that means for this implementation is that I think that _cxa_thread_atexit is allowed to be called during run_dtors. If running the dtor for a thread local variable 'cat', we encounter a previously unseen thread_local 'dog', the compiler will call the ctor, then register the dtor with _cxa_thread_atexit. Since it is the most recently constructed thread local object, I would expect the 'dog' dtor to be the next dtor to be run. You may be able to support this just by moving "elem = elem->next" below the dtor invocation.

joerg added a subscriber: joerg.Jun 29 2016, 2:40 PM

On the topic of __cxa_thread_atexit, was it ever specified how it interacts with things like thread cancellation?

dimitry added inline comments.
src/cxa_thread_atexit.cpp
45

run_dtors() is called when/if libc++.so gets unloaded... but only for the thread calling dlclose()?

On the topic of __cxa_thread_atexit, was it ever specified how it interacts with things like thread cancellation?

I don't think it's officially specified anywhere. C++ threads don't have a cancel method. The POSIX spec doesn't speak about the C++ ABI. The Itanium ABI could talk about this, but hasn't yet.

I think this implementation does the right thing with regards to cancellation though. POSIX says that first cancellation cleanup handlers are called, then thread-specific data destructors are called. pthread_cancel is still a really bad idea due to how it (doesn't) interact with RAII, but at least TLS data won't get leaked.

src/cxa_thread_atexit.cpp
45

Most of the dtor magic is on the pthread_key_create side. pthreads lets you register a per-thread destructor.

This destructor is only run on process termination (I think).

dimitry added inline comments.Jun 30 2016, 9:12 AM
src/cxa_thread_atexit.cpp
45

I meant the call from ~DtorListHolder()

Fixed some corner cases regarding destruction order and very-late-initialized thread_locals. Explicitly documented the known limitations compared to __cxa_thread_atexit_impl().

tavianator marked 3 inline comments as done.Jun 30 2016, 11:34 AM

What that means for this implementation is that I think that _cxa_thread_atexit is allowed to be called during run_dtors. If running the dtor for a thread local variable 'cat', we encounter a previously unseen thread_local 'dog', the compiler will call the ctor, then register the dtor with _cxa_thread_atexit. Since it is the most recently constructed thread local object, I would expect the 'dog' dtor to be the next dtor to be run. You may be able to support this just by moving "elem = elem->next" below the dtor invocation.

It wasn't quite that easy (have to re-look at the pthread_key to get newly added thread_locals), but that's done in the latest patch.

src/cxa_thread_atexit.cpp
38

See http://stackoverflow.com/q/38130185/502399 for a test case that would trigger this. This may not be necessary depending on the answer to that question.

45

This has changed somewhat in the latest patch, but the gist is similar. If libc++abi.so is dlclose()d, there had better not be any still-running threads that expect to execute thread_local destructors (or any other C++ code, for that matter). In the usual case (libc++abi.so loaded at startup, not by a later dlopen()), the last run_dtors() call happens as the final thread is exiting.

Added missing __dso_handle declaration.

Fix copy-pasta that result in an infinite loop.

bcraig added inline comments.Jun 30 2016, 12:57 PM
src/cxa_thread_atexit.cpp
47

Why are we doing this? I can see it being a little useful when debugging / developing, so that you get an early warning that something has gone wrong, but it seems like this will always be setting a value to the value it already has.

54

Maybe this concern is unfounded, but I'm not overly fond of pthread_getspecific and setspecific in a loop. I've always been under the impression that those functions are rather slow. Could we add a layer of indirection so that we don't need to call getspecific and setspecific so often? Basically make the pointer that is directly stored in TLS an immutable pointer to pointer.

tavianator added inline comments.Jun 30 2016, 1:17 PM
src/cxa_thread_atexit.cpp
47

pthread_key destructors run after the key is set to null. I re-set it here since the loop reads the key.

54

Sure, I can do that. Would reduce the number of setspecific() calls in __cxa_thread_atexit too.

Also, can you add test cases for a lot of these things? I don't expect test cases for the DSO side of things, but a lot of the tricky atexit cases should be covered.

tavianator updated this revision to Diff 62691.Jul 4 2016, 9:26 AM

Added a test case for destructor ordering. Got rid of pthread_{get,set}specific in a loop.

bcraig added inline comments.Jul 5 2016, 9:24 AM
src/cxa_thread_atexit.cpp
23

I'm going to have to agree with @majnemer. I think that the config check for LIBCXXABI_HAS_CXA_THREAD_ATEXIT_IMPL should stay in place. If cxa_thread_atexit_impl exists, then all of the fallback code can disappear at preprocessing time.

We do lose out on the minor benefit of avoiding some libc++ recompiles, but we also avoid code bloat.

For what it's worth, I'm willing to keep the weak symbol check in place if __cxa_thread_atexit_impl isn't present, I just don't want to pay for the fallback when I know I'm not going to use it.

47

The loop doesn't read pthread_getspecific anymore. I get the need for the setspecific call here for your previous design, but I don't think it's needed now.

test/thread_local_destruction_order.pass.cpp
2

Nit: file name is wrong here.

49

Can we have a CreatesThreadLocalInDestructor in the thread_fn as well? That way we can test both the main function and a pthread. If I understand your code and comments correctly, those go through different code paths.

55

In the places where you can, validate that dtors actually are getting called. This may be your only place where you can do that.

So something like 'assert(seq == 1)' here.

tavianator added inline comments.Jul 5 2016, 12:13 PM
src/cxa_thread_atexit.cpp
23

Makes sense, I'll do that.

47

__cxa_thread_atexit() calls pthread_getspecific(), so it's still needed. Otherwise it would create a new list instead of adding to the current one, and the ordering would be wrong.

test/thread_local_destruction_order.pass.cpp
49

Yep, meant to do that actually!

55

Sounds good. What I wanted to do was print some output in the destructors, and check for a certain expected output. But I couldn't figure out how to do that with lit.

bcraig added inline comments.Jul 5 2016, 12:23 PM
test/thread_local_destruction_order.pass.cpp
55

Normally, you would do that by piping the output to the llvm FileCheck utility. My unconfirmed suspicion is that that approach will run into difficulties because of libcxx and libcxxabi specific setups.

I think just having the global object's dtor check is good enough for the final post condition though. I'm not terribly worried about global dtors malfunctioning.

tavianator updated this revision to Diff 63232.Jul 8 2016, 8:31 AM
tavianator marked 6 inline comments as done.
  • Bring back HAVE___CXA_THREAD_ATEXIT_IMPL, and avoid the weak symbol/fallback implementation in that case
  • Fix a leak in an error path
  • Add a CreatesThreadLocalInDestructor to a non-main thread in the destructor ordering test
tavianator marked 9 inline comments as done.Jul 8 2016, 8:36 AM

LGTM (with a comment nit), but you'll need to get approval from @EricWF or @mclow.lists.

I would like some of the information from your stack overflow post to make it's way to the comments. In particular, I think I would like to see it documented that we have made a choice for some undefined behavior.

tavianator updated this revision to Diff 63801.Jul 13 2016, 6:54 AM

Update comments to mention that late-initialized thread_locals invoke undefined behavior.

tavianator updated this revision to Diff 63806.Jul 13 2016, 8:17 AM

Make sure the tail of the list is null.

Anything else I need to do for this patch?

Ping?

Well, I still think it's fine. Maybe a direct message to @mclow.lists or @EricWF?

Ping?

Well, I still think it's fine. Maybe a direct message to @mclow.lists or @EricWF?

Is there a way to do that through Phabricator? Or did you mean to email them directly? (Not sure what their emails are but I can probably figure it out from the list history.)

bcraig added a comment.Sep 1 2016, 1:33 PM

The "@" will do a ping through phabricator, but a direct email is probably going to be your best bet at this point.

EricWF edited edge metadata.Sep 1 2016, 1:34 PM

I'll look at this within the hour.

EricWF added a comment.Sep 1 2016, 4:56 PM

We can perform far fewer calls to pthread_getspecific/pthread_setspecific if we represent the list head using a global __thread DtorList* list_head = nullptr.
This also allows us to avoid the hack of setting/unsetting the key during run_dtors() which I really do not like.

Here is a patch that applies such changes: https://gist.github.com/EricWF/a071376b1216aabdd1695eec2175c374

What do you think of this idea?

src/cxa_thread_atexit.cpp
42

Can you clarify what you mean by "other threads"?

How is libc++ supposed to detect and handle this problem?

jroelofs edited edge metadata.Sep 1 2016, 5:07 PM

__thread

What do you think of this idea?

You'll have to guard it against all the platforms that don't support TLS. Darwin 10.6 is one of them.

EricWF added a comment.EditedSep 1 2016, 5:15 PM

__thread

What do you think of this idea?

You'll have to guard it against all the platforms that don't support TLS. Darwin 10.6 is one of them.

Which is fine because we shouldn't supply a definition of __cxa_thread_atexit on those platforms.

__thread

What do you think of this idea?

You'll have to guard it against all the platforms that don't support TLS. Darwin 10.6 is one of them.

Which is fine because we shouldn't supply a definition of __cxa_thread_atexit on those platforms.

Ah, ok. Sounds good.

__thread

What do you think of this idea?

Makes sense to me, I'll integrate it into the next revision.

src/cxa_thread_atexit.cpp
42

I meant "non-main threads" ("other" is in relation to the bullet point above), but I can clarify this, sure.

libc++ could be patched to do something like this:

pthread_key_t key1, key2;

void destructor1(void* ptr) {
  pthread_setspecific(key2, ptr);
}

void destructor2(void* ptr) {
  // Runs in the second iteration through pthread_key destructors,
  // therefore after thread_local destructors
}

pthread_key_create(&key1, destructor1);
pthread_key_create(&key2, destructor2);

pthread_setspecific(key1, ptr);

(Or it could use a counter/flag and a single pthread_key.)

libstdc++ has the same bug when __cxa_thread_atexit_impl() isn't available, so I'm not sure that change would really be necessary. If it is, I can write up the libc++ patch.

EricWF added inline comments.Sep 2 2016, 11:36 AM
src/cxa_thread_atexit.cpp
42

I don't think we need to patch this in libc++. Especially because it would be incorrect in the vast majority of cases.

tavianator updated this revision to Diff 70406.Sep 6 2016, 8:52 AM
tavianator edited edge metadata.

Uses a __thread variable to hold the destructor list, as @EricWF suggested.

EricWF added a comment.Sep 6 2016, 5:45 PM

LGTM modulo bug fix.

src/cxa_thread_atexit.cpp
70

There is a bug here. If head->next == nullptr and if head->dtor(head->obj)) creates a TL variable in the destructor then that destructor will not be invoked.

Here's an updated test case which catches the bug: https://gist.github.com/EricWF/3bb50d4f28b91aa28d2adefea0e94a0e

tavianator added inline comments.Sep 7 2016, 7:04 AM
src/cxa_thread_atexit.cpp
70

I can't reproduce that failure here, your exact test case passes (even with #undef HAVE___CXA_THREAD_ATEXIT_IMPL and the weak symbol test commented out).

Tracing the implementation logic, it seems correct. If head->next == nullptr then this line does dtors = nullptr. Then if head->dtor(head->obj) registers a new thread_local, __cxa_thread_atexit() does head = malloc(...); ... dtors = head;. Then the next iteration of the loop while (auto head = dtors) { picks up that new node.

Have I missed something?

EricWF added inline comments.Sep 7 2016, 2:03 PM
src/cxa_thread_atexit.cpp
70

I can't reproduce this morning either, I must have been doing something funny. I'll look at this with a fresh head tomorrow. If I can't find anything this will be good to go. Thanks for working on this.

tavianator added inline comments.Sep 8 2016, 1:18 PM
src/cxa_thread_atexit.cpp
70

No problem! I can integrate your updated test case anyway if you want.

LGTM after addressing the inline comments.

src/cxa_thread_atexit.cpp
70

Yeah I would like to see the upgraded test case applied. At least that way we're testing the case in question.

So I agree with your above analysis of what happens, and that all destructors are correctly called during the first iteration of pthread key destruction. My one issues is that we still register a new non-null key which forces pthread to run the destructor for the key again. I would like to see this fixed.

tavianator updated this revision to Diff 71508.Sep 15 2016, 8:24 AM

Integrated @EricWF's expanded test case, and avoid an unneeded pthread_setspecific() call if the last thread_local's destructor initializes a new thread_local.

tavianator marked 5 inline comments as done.Sep 15 2016, 8:27 AM
tavianator added inline comments.
src/cxa_thread_atexit.cpp
70

Yep, done! I guess that was the point of thread_alive from your original patch-to-my-patch, sorry for stripping it out.

EricWF accepted this revision.Sep 15 2016, 12:05 PM
EricWF edited edge metadata.

LGTM. Thanks for the patch.

src/cxa_thread_atexit.cpp
70

Na I thought I was solving a bug that didn't exist. You give me too much credit.

This revision is now accepted and ready to land.Sep 15 2016, 12:05 PM
tavianator updated this revision to Diff 71934.Sep 20 2016, 8:07 AM
tavianator edited edge metadata.
tavianator marked an inline comment as done.

s/indended/intended/

@tavianator: I'm about to commit D24864, which will affect this patch (indirectly). D24864 basically refactors all pthread dependencies behind a separate API. It would be pretty straightforward for you to update this patch though, just replacing pthread calls with ones in thread_support.h (perhaps adding anything missing).

Hope you don't mind me going first? If you are going to commit this soon, I can hold off D24864. Let me know.

Cheers,

/ Asiri

@rmaprath I'll merge this if needed. Feel free to commit your patch first.

@tavianator Do you need somebody to merge this for you?

@rmaprath I'll merge this if needed. Feel free to commit your patch first.

Yeah, @rmaprath I'm happy to rebase this over your patch.

@tavianator Do you need somebody to merge this for you?

I assume so, yeah.

@rmaprath I'll merge this if needed. Feel free to commit your patch first.

Yeah, @rmaprath I'm happy to rebase this over your patch.

My patch got a bit stuck downstream :(

So you / @EricWF can go ahead with this patch, I'll rebase mine over yours when I'm ready to commit.

/ Asiri

EricWF closed this revision.Oct 12 2016, 2:03 AM

Committed as r283988. Thanks for the patch!