Implement stop_token
http://eel.is/c++draft/thread.stoptoken
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Could you extract atomic_unique_lock.h, intrusive_list.h and intrusive_shared_ptr.h into a separate review? I think we can check those in soon and that will greatly simplify this review. I don't know where to put them. It's not great, but I'd suggest leaving them in __stop_token and we can think about reorganization later.
libcxx/include/__stop_token/atomic_unique_lock.h | ||
---|---|---|
2 | Since this is general purpose, it makes little sense to keep this under __stop_token/, but we also don't have any good place for these general-purpose utilities. No action item, but let's keep this at the back of our mind that we might need to reorganize things a bit in the future. | |
5 | Can you add just a few simple tests for this class under libcxx/test/libcxx? | |
41 | Would it make sense to use std::try_to_lock_t here to convey that we're taking the lock? Otherwise, this can be confusing with std::unique_lock. | |
67 | Maybe we could call this __owns_lock() to mirror unique_lock? | |
88 | I think a short comment explaining what __fail_to_lock and __state_after_lock are used for, it's not obvious from the names alone. Actually, maybe __fail_to_lock should be called __give_up_locking? __fail_to_lock can sound like something that is called in case you've failed to take the lock, which is not quite correct. | |
118 | ||
libcxx/include/__stop_token/intrusive_list.h | ||
2 | Can you add just a few simple tests for this class under libcxx/test/libcxx? | |
32 | __empty()? Or __is_empty()? | |
43–48 | I think this is fine, but you should add a comment explaining that it's fine not to set __front->__next_ = nullptr, since it's not part of the linked list anymore. | |
58–60 | If we don't have a prev, then we are necessarily the head. I would instead change this to: if (...) { ... } else { _LIBCPP_ASSERT(__node == __head_, "message"); __pop_front(); } IOW, I think this function needs the precondition that __node is in the list, similarly to how std::list requires the iterator to be from this list when you call erase(). I think that's a really useful and important model to have. | |
libcxx/include/__stop_token/intrusive_shared_ptr.h | ||
28–29 ↗ | (On Diff #512113) | Can you please add a comment documenting what this type is and what its contents should be for anyone specializing it? |
31–32 ↗ | (On Diff #512113) | Let's also add a short comment documenting this class, since it's general purpose in nature and we might reuse it in the future. |
35 ↗ | (On Diff #512113) | I'd make this explicit, like the one for std::shared_ptr. |
libcxx/include/__stop_token/stop_callback.h | ||
85 | Isn't this always nullptr since you default initialize it above? | |
libcxx/include/__stop_token/stop_state.h | ||
151 | ||
154 | I think we shouldn't call this __remove if we know that __cb is not in the list anymore, which can be checked via __cb->__prev_ == nullptr && !__callback_list_.__is_head(__cb). Oh and that's exactly the check you make above with __potentially_executing_now! And now I think we have the precondition we want for __remove. |
libcxx/include/__stop_token/stop_source.h | ||
---|---|---|
49 | I would add a comment like: // increment `other` first so that we don't hit 0 in case of self-assignment | |
73–75 |
I assume this is tested in stop_requested.pass.cpp? | |
libcxx/include/__stop_token/stop_state.h | ||
43–51 | ||
54–56 | ||
83 | ||
89–90 | ||
124 | Or something! | |
151 | ||
libcxx/test/std/thread/thread.stoptoken/stopcallback/cons.const.token.pass.cpp | ||
12–13 | You can replace this by // XFAIL: availability-synchronization_library-missing now. Here and everywhere else. | |
libcxx/test/std/thread/thread.stoptoken/stopcallback/dtor.pass.cpp | ||
156 | Maybe a bit more explicit? | |
158 | Is there any reason why you don't initialize sc_ directly with std::make_unique<>? That would make it clear that ptr is not actually meaningful other than a temporary variable. |
libcxx/include/__stop_token/stop_state.h | ||
---|---|---|
35 | Here I think we do need to store a pointer to a bool saying whether we've been destroyed to handle https://eel.is/c++draft/thread.stoptoken#stopcallback.cons-6 like we've discussed. |
I am extremely happy with the state of this patch! Thanks so much for all the iterations, I feel like we really have something good with great test coverage now. I'm especially happy about how things ended up after intrusive shared ptr and other generic bits were extracted, I think the code is really neat now.
Once the remaining comments have been addressed and the CI is passing, feel free to ship this. Thank you!
libcxx/include/__stop_token/stop_callback.h | ||
---|---|---|
43 | Do you have a test for this typedef? | |
85 | Do you have a test for the fact that the stop_callback won't get shared ownership of the __stop_state if the constructor runs the callback immediately (aka if a stop has been requested at the moment of calling this constructor). Edit: I don't think there's a way to test that without poking into the library internals -- IOW I'm not sure this is user-observable. | |
85–86 | I might combine these two ifs into if (__state && __state->__add_callback(this)). | |
libcxx/include/__stop_token/stop_source.h | ||
29 | Do you have a test for this? We need to make sure that:
| |
libcxx/include/__stop_token/stop_state.h | ||
2 | Recording the discussion: We considered whether we should move some of these classes to the dylib, and I think the answer should be no. There isn't *that much* code, it's not platform-specific (e.g. no including windows.h), and it would add backdeployment requirements. All in all, I don't think there's enough bang for the buck. | |
57 | ||
71 | Could we add a _LIBCPP_ASSERT here to make sure that we don't overflow? Same below for the decrement: if we decrement when we are already at 0, I guess it means that there's a lifetime bug in the program somewhere and someone's using a stop_source that has already been destroyed. Since that's easy to catch, I think we should add those assertions. | |
110 | Taking note of what you explained just now: Since there's only one thread looping through the list and executing callbacks, it should be possible to store a single pointer to the currently-executing callback in the stop state. That would avoid requiring a __completed_ and a __destroyed_ member in the callback. Then in __remove_callback, we would atomic-wait on the currently-executing callback not being this. I would suggest that we ship this patch with the current approach, and we investigate this improvement in a separate patch. | |
176 | ||
libcxx/include/__stop_token/stop_token.h | ||
53 | This private is redundant with the one above. | |
libcxx/test/std/thread/thread.stoptoken/stopcallback/cons.const.token.pass.cpp | ||
96 | Don't change anything, but we generally use snake_case for variable names. Just in the future. | |
libcxx/test/std/thread/thread.stoptoken/stopsource/cons.copy.pass.cpp | ||
10 | This is duplicated with the line below (please double-check throughout all your tests). | |
13 | Here you're missing // XFAIL: availability-synchronization_library-missing, that would fix some of your CI. | |
libcxx/utils/generate_header_tests.py | ||
1 ↗ | (On Diff #523770) | To fix your CI issue, I think you need to add "stop_token": ["UNSUPPORTED: no-threads, availability-synchronization_library-missing"], to libcxx/utils/generate_feature_test_macro_components.py. |
Thanks for working on this. I've a few nits, otherwise LGTM!
libcxx/include/__stop_token/stop_callback.h | ||
---|---|---|
79 | ||
libcxx/include/__stop_token/stop_state.h | ||
35 | ||
53 | please search for other places too. | |
65 | ||
libcxx/test/std/thread/thread.stoptoken/nostopstate/cons.default.pass.cpp | ||
37 ↗ | (On Diff #524441) | How about testing the type? |
libcxx/test/std/thread/thread.stoptoken/stopcallback/cons.const.token.pass.cpp | ||
62 | I really dislike using auto here. Either use a real type or the same_as. |
libcxx/include/__stop_token/stop_token.h | ||
---|---|---|
26 | I think my comment about fixing the backdeployment CI using availability-synchronization_library-missing was misguided. I'm not sure, but it seems like adding _LIBCPP_AVAILABILITY_SYNC here should tell the compiler that the class is only available when the synchronization library is available, and it should avoid getting an error when you merely #include <stop_token>. If you try to use std::stop_token, then it would fail with an error message but that's intended and it wouldn't impact the generated tests like modules_include.sh.cpp. |
Can you confirm whether this landed as b77e50e6aef5650d1ce0ce7ad1a41049395b9426? If so, let's close this.
Sorry for the belated feedback - I only just noticed that the stop_token implementation had been published.
The implementation looks like it's doing all the right things from a correctness point of view, which is good!
I have a few suggestions for some changes that might help reduce the number of atomic ops needed for some operations which you could consider if performance was problematic.
libcxx/include/__stop_token/stop_source.h | ||
---|---|---|
65 | The way that you have separated the ref-count from the stop_source_counter in the state_t means that destroying a stop_source object now needs to update two separate atomic variables, resulting in two atomic operations instead of one. | |
libcxx/include/__stop_token/stop_state.h | ||
59 | Initialising the ref-count to zero means that you need to do an atomic increment when the object is first constructed. Since you know that a state is only ever allocated by a stop_source, which will immediately take a reference, you could just initialise the reference to 1 here and have the stop_source constructor's intrusive_shared_ptr "adopt" the reference. Further, if you merged the ref-count and state_t atomics into a single 64-bit atomic then you could reduce the destruction of a stop_source to a single atomic subtraction. Making both of these changes would allow reducing the number of atomics executed when constructing a stop_source and immediately destructing it from 4 atomic ops in the current implementation, to just 1 atomic operation. | |
133 | Note that you don't need to lock again here if you know that this callback was the last callback in the list, since once you have set the "stop-requested" flag, then no more callback items should be enqueued. This should save you an additional atomic lock/unlock at the cost of a branch here. | |
160 | If you were to combine the state_t and the ref-count you could potentially merge the ref-count increment with the unlock into a single atomic operation by adding (__ref_count_increment - _lock_bit) atomically (in the case where the stop_callback is being constructed with a const stop_token&). This would save an extra atomic op. Ideally the compiler would be able to merge consecutive atomic ops like this, but I don't think I've seen clang do such a thing before. |
Since we want to investigate some improvements, I would start by making stop_token experimental. Then we can add some benchmarks that faithfully represent the intended use cases for stop_token, not just some random benchmarks that test the performance of some arbitrary operations. @lewissbaker are you aware of any benchmarks that would be representative of the way stop_token intends to be used?
Then we can try a few things (in no order):
- Try to remove some atomic operations as laid out by @lewissbaker above
- Try implementing the lock with try { mutex.lock(); } catch (...) { ... }; instead and see if it's faster. We need try-catch cause the methods on stop_token are noexcept.
- Try to see if there's something obviously wrong with our atomic_unique_lock implementation. Maybe we're using compare_exchange_weak when we should be using strong, or maybe there's some other perf issue with our implementation.
I would also not eliminate the possibility that the current implementation actually performs quite well for the intended use cases of stop_token. But the benchmarks will give us more insight into that.
I wrote up a few basic tests/benchmarks for an older implementation of the design in cppcoro here:
https://github.com/lewissbaker/cppcoro/blob/master/test/cancellation_token_tests.cpp
Although I don't think these tests are necessarily representative of real-world usage.
There are a few use-cases that are going to be common for stop_tokens which could be good candidates for benchmarks:
- The std::jthread use-case.
In this case, you will have a single thread created by std::jthread consuming the stop_token:
- registering/deregistering callbacks, usually just one at a time (e.g. in implementation of condition_variable::wait functions that take a stop-token), but sometimes might register one or more long-lived callbacks combined with single ephemeral callbacks.
- polling for a stop_request / stop_possible
- copying the stop_token (e.g. when passing by-value to nested function calls)
And then a main thread will eventually call request_stop() from the destructor of the std::jthread.
A rough sketch:
int main() { std::uint64_t count = 0; { std::jthread t([&](std::stop_token st) { std::uint64_t local_count = 0; while (!st.stop_requested()) { std::stop_callback cb{st, [&]() noexcept {}}; ++local_count; } count = local_count; }); std::this_thread::sleep_for(30s); } std::cout << "Thread did " << count << " callback registration/deregistration in 30s\n"; }
If you also have an implementation of the condition_variable::wait() functions then you could also try calling the stop_token-taking versions of those instead.
- Async use-case #1 - Shutdown stop_token
At app startup, the app creates a single stop_source which it will then pass an associated stop_token to every request.
Assume a thread-pool handles these requests and for each request it polls for stop_requested(), then attaches a stop-callback, does some work, then detaches the stop-callback some time later.
The lifetime of requests/callbacks would overlap with other requests/callback from the same thread.
Say something like each thread keeping a circular buffer of N stop-callbacks and destroying the stop-callbacks in FIFO order
Rough sketch:
constexpr size_t thread_count = 20; constexpr size_t concurrent_request_count = 1000; std::atomic<bool> ready{false}; struct dummy_stop_callback { void operator()() const noexcept {} }; void thread_func(uint64_t& count, std::stop_token st) { std::vector<std::optional<std::stop_callback<dummy_stop_callback>>> cbs(concurrent_request_count); ready.wait(false); std::uint32_t index = 0; std::uint64_t local_count = 0; while (!st.stop_requested()) { cbs[index].emplace(st, dummy_stop_callback{}); index = (index + 1) % concurrent_request_count; ++local_count; } count = local_count; } int main() { std::vector<std::uint64_t> counts(thread_count, 0); std::stop_source ss; { std::vector<std::jthread> threads; { auto release_on_exit = on_scope_exit([&]() { ready = true; ready.notify_all(); }); for (size_t i = 0; i < thread_count; ++i) { threads.emplace_back(thread_func, counts[i], ss.get_token()); } } std::this_thread::sleep_for(std::chrono::seconds(10)); ss.request_stop(); } std::uint64_t total_count = std::reduce(counts.begin(), counts.end()); std::cout << "Total iterations of " << thread_count << " threads for 10s was " << total_count << "\n"; }
Variations on this might involve deregistering stop-callbacks on different threads than the threads they were registered on.
But this is more involved to setup.
- Async use-case #2 - Per-request stop_source
This is a variation on the previous use-case, but where each task ends up creating a bunch of different chained stop_sources to simulate the concurrent sub-task nature of some operations.
e.g. for each request, we create a new stop_source, A, and then attach a stop_callback to the global shutdown stop-token which then calls request_stop() on the new stop_source if called.
Then we simulate passing the a stop_token to two child operations, each of which then creates its own stop_source (B and C) and subscribes a stop_callback to A which calls request_stop() on B or C respectively.
This can be done to an arbitrary number of ops per request, with varying tree widths/depths.
This should help to evaluate the performance of creation/destruction of stop_source and non-contended registration of stop-callbacks with less of a focus on lots of threads hitting the same stop-state.
You could also try a variation on this without the top-level shutdown stop_callback.
It is worth noting, however, that with P2300 there is also a proposed std::in_place_stop_source, which provides an alternative that does not heap-allocate/ref-count but for which the caller is required to keep the stop-source alive until all child ops that use the stop-tokens are done with it, and so I'd expect a lot of the structured async use-cases to eventually use that where possible instead of std::stop_source. This in-place version was not appropriate for jthread since that needs to be movable.
Thanks a lot, this is great! @huixie90 I think this is something we can work with -- let's make this experimental, add a few benchmarks like laid out above and then iterate on that!
I created https://github.com/llvm/llvm-project/issues/63738 to track these optimizations. Let's close this patch since it has been shipped in b77e50e6aef5650d1ce0ce7ad1a41049395b9426
After this patch it's partial :-) Can you document which parts are still missing?
We have multiple partial done papers, without any information what is and what is not done. That makes it hard to finish these papers.