Implement stop_token
http://eel.is/c++draft/thread.stoptoken
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
Could you extract atomic_unique_lock.h, intrusive_list.h and intrusive_shared_ptr.h into a separate review? I think we can check those in soon and that will greatly simplify this review. I don't know where to put them. It's not great, but I'd suggest leaving them in __stop_token and we can think about reorganization later.
libcxx/include/__stop_token/atomic_unique_lock.h | ||
---|---|---|
1 ↗ | (On Diff #512113) | Since this is general purpose, it makes little sense to keep this under __stop_token/, but we also don't have any good place for these general-purpose utilities. No action item, but let's keep this at the back of our mind that we might need to reorganize things a bit in the future. |
4 ↗ | (On Diff #512113) | Can you add just a few simple tests for this class under libcxx/test/libcxx? |
40 ↗ | (On Diff #512113) | Would it make sense to use std::try_to_lock_t here to convey that we're taking the lock? Otherwise, this can be confusing with std::unique_lock. |
66 ↗ | (On Diff #512113) | Maybe we could call this __owns_lock() to mirror unique_lock? |
87 ↗ | (On Diff #512113) | I think a short comment explaining what __fail_to_lock and __state_after_lock are used for, it's not obvious from the names alone. Actually, maybe __fail_to_lock should be called __give_up_locking? __fail_to_lock can sound like something that is called in case you've failed to take the lock, which is not quite correct. |
117 ↗ | (On Diff #512113) | |
libcxx/include/__stop_token/intrusive_list.h | ||
1 ↗ | (On Diff #512113) | Can you add just a few simple tests for this class under libcxx/test/libcxx? |
31 ↗ | (On Diff #512113) | __empty()? Or __is_empty()? |
42–47 ↗ | (On Diff #512113) | I think this is fine, but you should add a comment explaining that it's fine not to set __front->__next_ = nullptr, since it's not part of the linked list anymore. |
57–59 ↗ | (On Diff #512113) | If we don't have a prev, then we are necessarily the head. I would instead change this to: if (...) { ... } else { _LIBCPP_ASSERT(__node == __head_, "message"); __pop_front(); } IOW, I think this function needs the precondition that __node is in the list, similarly to how std::list requires the iterator to be from this list when you call erase(). I think that's a really useful and important model to have. |
libcxx/include/__stop_token/intrusive_shared_ptr.h | ||
28–29 ↗ | (On Diff #512113) | Can you please add a comment documenting what this type is and what its contents should be for anyone specializing it? |
31–32 ↗ | (On Diff #512113) | Let's also add a short comment documenting this class, since it's general purpose in nature and we might reuse it in the future. |
35 ↗ | (On Diff #512113) | I'd make this explicit, like the one for std::shared_ptr. |
libcxx/include/__stop_token/stop_callback.h | ||
85 | Isn't this always nullptr since you default initialize it above? | |
libcxx/include/__stop_token/stop_state.h | ||
151 | ||
154 | I think we shouldn't call this __remove if we know that __cb is not in the list anymore, which can be checked via __cb->__prev_ == nullptr && !__callback_list_.__is_head(__cb). Oh and that's exactly the check you make above with __potentially_executing_now! And now I think we have the precondition we want for __remove. |
libcxx/include/__stop_token/stop_source.h | ||
---|---|---|
49 | I would add a comment like: // increment `other` first so that we don't hit 0 in case of self-assignment | |
73–75 |
I assume this is tested in stop_requested.pass.cpp? | |
libcxx/include/__stop_token/stop_state.h | ||
43–51 | ||
54–56 | ||
83 | ||
89–90 | ||
124 | Or something! | |
151 | ||
libcxx/test/std/thread/thread.stoptoken/stopcallback/cons.const.token.pass.cpp | ||
12–13 | You can replace this by // XFAIL: availability-synchronization_library-missing now. Here and everywhere else. | |
libcxx/test/std/thread/thread.stoptoken/stopcallback/dtor.pass.cpp | ||
156 | Maybe a bit more explicit? | |
158 | Is there any reason why you don't initialize sc_ directly with std::make_unique<>? That would make it clear that ptr is not actually meaningful other than a temporary variable. |
libcxx/include/__stop_token/stop_state.h | ||
---|---|---|
35 | Here I think we do need to store a pointer to a bool saying whether we've been destroyed to handle https://eel.is/c++draft/thread.stoptoken#stopcallback.cons-6 like we've discussed. |
libcxx/include/__stop_token/atomic_unique_lock.h | ||
---|---|---|
40 ↗ | (On Diff #512113) | std::try_to_lock_t is non-blocking. here we are still blocking so might not be appropriate |
libcxx/include/__stop_token/stop_callback.h | ||
85 | __state is a function parameter which is not always nullptr |
I am extremely happy with the state of this patch! Thanks so much for all the iterations, I feel like we really have something good with great test coverage now. I'm especially happy about how things ended up after intrusive shared ptr and other generic bits were extracted, I think the code is really neat now.
Once the remaining comments have been addressed and the CI is passing, feel free to ship this. Thank you!
libcxx/include/__stop_token/stop_callback.h | ||
---|---|---|
43 | Do you have a test for this typedef? | |
85 | Do you have a test for the fact that the stop_callback won't get shared ownership of the __stop_state if the constructor runs the callback immediately (aka if a stop has been requested at the moment of calling this constructor). Edit: I don't think there's a way to test that without poking into the library internals -- IOW I'm not sure this is user-observable. | |
85–86 | I might combine these two ifs into if (__state && __state->__add_callback(this)). | |
libcxx/include/__stop_token/stop_source.h | ||
29 | Do you have a test for this? We need to make sure that:
| |
libcxx/include/__stop_token/stop_state.h | ||
2 | Recording the discussion: We considered whether we should move some of these classes to the dylib, and I think the answer should be no. There isn't *that much* code, it's not platform-specific (e.g. no including windows.h), and it would add backdeployment requirements. All in all, I don't think there's enough bang for the buck. | |
57 | ||
71 | Could we add a _LIBCPP_ASSERT here to make sure that we don't overflow? Same below for the decrement: if we decrement when we are already at 0, I guess it means that there's a lifetime bug in the program somewhere and someone's using a stop_source that has already been destroyed. Since that's easy to catch, I think we should add those assertions. | |
110 | Taking note of what you explained just now: Since there's only one thread looping through the list and executing callbacks, it should be possible to store a single pointer to the currently-executing callback in the stop state. That would avoid requiring a __completed_ and a __destroyed_ member in the callback. Then in __remove_callback, we would atomic-wait on the currently-executing callback not being this. I would suggest that we ship this patch with the current approach, and we investigate this improvement in a separate patch. | |
176 | ||
libcxx/include/__stop_token/stop_token.h | ||
53 | This private is redundant with the one above. | |
libcxx/test/std/thread/thread.stoptoken/stopcallback/cons.const.token.pass.cpp | ||
96 | Don't change anything, but we generally use snake_case for variable names. Just in the future. | |
libcxx/test/std/thread/thread.stoptoken/stopsource/cons.copy.pass.cpp | ||
10 | This is duplicated with the line below (please double-check throughout all your tests). | |
13 | Here you're missing // XFAIL: availability-synchronization_library-missing, that would fix some of your CI. | |
libcxx/utils/generate_header_tests.py | ||
1 | To fix your CI issue, I think you need to add "stop_token": ["UNSUPPORTED: no-threads, availability-synchronization_library-missing"], to libcxx/utils/generate_feature_test_macro_components.py. |
Thanks for working on this. I've a few nits, otherwise LGTM!
libcxx/include/__stop_token/stop_callback.h | ||
---|---|---|
79 | ||
libcxx/include/__stop_token/stop_state.h | ||
35 | ||
53 | please search for other places too. | |
65 | ||
libcxx/test/std/thread/thread.stoptoken/nostopstate/cons.default.pass.cpp | ||
38 | How about testing the type? | |
libcxx/test/std/thread/thread.stoptoken/stopcallback/cons.const.token.pass.cpp | ||
62 | I really dislike using auto here. Either use a real type or the same_as. |
libcxx/include/__stop_token/stop_token.h | ||
---|---|---|
25 | I think my comment about fixing the backdeployment CI using availability-synchronization_library-missing was misguided. I'm not sure, but it seems like adding _LIBCPP_AVAILABILITY_SYNC here should tell the compiler that the class is only available when the synchronization library is available, and it should avoid getting an error when you merely #include <stop_token>. If you try to use std::stop_token, then it would fail with an error message but that's intended and it wouldn't impact the generated tests like modules_include.sh.cpp. |
Can you confirm whether this landed as b77e50e6aef5650d1ce0ce7ad1a41049395b9426? If so, let's close this.
Sorry for the belated feedback - I only just noticed that the stop_token implementation had been published.
The implementation looks like it's doing all the right things from a correctness point of view, which is good!
I have a few suggestions for some changes that might help reduce the number of atomic ops needed for some operations which you could consider if performance was problematic.
libcxx/include/__stop_token/stop_source.h | ||
---|---|---|
65 | The way that you have separated the ref-count from the stop_source_counter in the state_t means that destroying a stop_source object now needs to update two separate atomic variables, resulting in two atomic operations instead of one. | |
libcxx/include/__stop_token/stop_state.h | ||
59 | Initialising the ref-count to zero means that you need to do an atomic increment when the object is first constructed. Since you know that a state is only ever allocated by a stop_source, which will immediately take a reference, you could just initialise the reference to 1 here and have the stop_source constructor's intrusive_shared_ptr "adopt" the reference. Further, if you merged the ref-count and state_t atomics into a single 64-bit atomic then you could reduce the destruction of a stop_source to a single atomic subtraction. Making both of these changes would allow reducing the number of atomics executed when constructing a stop_source and immediately destructing it from 4 atomic ops in the current implementation, to just 1 atomic operation. | |
133 | Note that you don't need to lock again here if you know that this callback was the last callback in the list, since once you have set the "stop-requested" flag, then no more callback items should be enqueued. This should save you an additional atomic lock/unlock at the cost of a branch here. | |
160 | If you were to combine the state_t and the ref-count you could potentially merge the ref-count increment with the unlock into a single atomic operation by adding (__ref_count_increment - _lock_bit) atomically (in the case where the stop_callback is being constructed with a const stop_token&). This would save an extra atomic op. Ideally the compiler would be able to merge consecutive atomic ops like this, but I don't think I've seen clang do such a thing before. |
Since we want to investigate some improvements, I would start by making stop_token experimental. Then we can add some benchmarks that faithfully represent the intended use cases for stop_token, not just some random benchmarks that test the performance of some arbitrary operations. @lewissbaker are you aware of any benchmarks that would be representative of the way stop_token intends to be used?
Then we can try a few things (in no order):
- Try to remove some atomic operations as laid out by @lewissbaker above
- Try implementing the lock with try { mutex.lock(); } catch (...) { ... }; instead and see if it's faster. We need try-catch cause the methods on stop_token are noexcept.
- Try to see if there's something obviously wrong with our atomic_unique_lock implementation. Maybe we're using compare_exchange_weak when we should be using strong, or maybe there's some other perf issue with our implementation.
I would also not eliminate the possibility that the current implementation actually performs quite well for the intended use cases of stop_token. But the benchmarks will give us more insight into that.
I wrote up a few basic tests/benchmarks for an older implementation of the design in cppcoro here:
https://github.com/lewissbaker/cppcoro/blob/master/test/cancellation_token_tests.cpp
Although I don't think these tests are necessarily representative of real-world usage.
There are a few use-cases that are going to be common for stop_tokens which could be good candidates for benchmarks:
- The std::jthread use-case.
In this case, you will have a single thread created by std::jthread consuming the stop_token:
- registering/deregistering callbacks, usually just one at a time (e.g. in implementation of condition_variable::wait functions that take a stop-token), but sometimes might register one or more long-lived callbacks combined with single ephemeral callbacks.
- polling for a stop_request / stop_possible
- copying the stop_token (e.g. when passing by-value to nested function calls)
And then a main thread will eventually call request_stop() from the destructor of the std::jthread.
A rough sketch:
int main() { std::uint64_t count = 0; { std::jthread t([&](std::stop_token st) { std::uint64_t local_count = 0; while (!st.stop_requested()) { std::stop_callback cb{st, [&]() noexcept {}}; ++local_count; } count = local_count; }); std::this_thread::sleep_for(30s); } std::cout << "Thread did " << count << " callback registration/deregistration in 30s\n"; }
If you also have an implementation of the condition_variable::wait() functions then you could also try calling the stop_token-taking versions of those instead.
- Async use-case #1 - Shutdown stop_token
At app startup, the app creates a single stop_source which it will then pass an associated stop_token to every request.
Assume a thread-pool handles these requests and for each request it polls for stop_requested(), then attaches a stop-callback, does some work, then detaches the stop-callback some time later.
The lifetime of requests/callbacks would overlap with other requests/callback from the same thread.
Say something like each thread keeping a circular buffer of N stop-callbacks and destroying the stop-callbacks in FIFO order
Rough sketch:
constexpr size_t thread_count = 20; constexpr size_t concurrent_request_count = 1000; std::atomic<bool> ready{false}; struct dummy_stop_callback { void operator()() const noexcept {} }; void thread_func(uint64_t& count, std::stop_token st) { std::vector<std::optional<std::stop_callback<dummy_stop_callback>>> cbs(concurrent_request_count); ready.wait(false); std::uint32_t index = 0; std::uint64_t local_count = 0; while (!st.stop_requested()) { cbs[index].emplace(st, dummy_stop_callback{}); index = (index + 1) % concurrent_request_count; ++local_count; } count = local_count; } int main() { std::vector<std::uint64_t> counts(thread_count, 0); std::stop_source ss; { std::vector<std::jthread> threads; { auto release_on_exit = on_scope_exit([&]() { ready = true; ready.notify_all(); }); for (size_t i = 0; i < thread_count; ++i) { threads.emplace_back(thread_func, counts[i], ss.get_token()); } } std::this_thread::sleep_for(std::chrono::seconds(10)); ss.request_stop(); } std::uint64_t total_count = std::reduce(counts.begin(), counts.end()); std::cout << "Total iterations of " << thread_count << " threads for 10s was " << total_count << "\n"; }
Variations on this might involve deregistering stop-callbacks on different threads than the threads they were registered on.
But this is more involved to setup.
- Async use-case #2 - Per-request stop_source
This is a variation on the previous use-case, but where each task ends up creating a bunch of different chained stop_sources to simulate the concurrent sub-task nature of some operations.
e.g. for each request, we create a new stop_source, A, and then attach a stop_callback to the global shutdown stop-token which then calls request_stop() on the new stop_source if called.
Then we simulate passing the a stop_token to two child operations, each of which then creates its own stop_source (B and C) and subscribes a stop_callback to A which calls request_stop() on B or C respectively.
This can be done to an arbitrary number of ops per request, with varying tree widths/depths.
This should help to evaluate the performance of creation/destruction of stop_source and non-contended registration of stop-callbacks with less of a focus on lots of threads hitting the same stop-state.
You could also try a variation on this without the top-level shutdown stop_callback.
It is worth noting, however, that with P2300 there is also a proposed std::in_place_stop_source, which provides an alternative that does not heap-allocate/ref-count but for which the caller is required to keep the stop-source alive until all child ops that use the stop-tokens are done with it, and so I'd expect a lot of the structured async use-cases to eventually use that where possible instead of std::stop_source. This in-place version was not appropriate for jthread since that needs to be movable.
Thanks a lot, this is great! @huixie90 I think this is something we can work with -- let's make this experimental, add a few benchmarks like laid out above and then iterate on that!
I created https://github.com/llvm/llvm-project/issues/63738 to track these optimizations. Let's close this patch since it has been shipped in b77e50e6aef5650d1ce0ce7ad1a41049395b9426