This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libcxx/
-
benchmarks/
-
CMakeLists.txt
2
stop_token.async_shutdown.bench.cpp
-
stop_token.jthread.bench.cpp
-
include/__stop_token/
-
__stop_token/
-
stop_state.h

Differential D154702

[libc++] [DO NOT MERGE] benchmark stop_token and use std::mutex in the implementation of stop_token
AbandonedPublic

Authored by ldionne on Jul 7 2023, 4:05 AM.

Download Raw Diff

Details

Reviewers

EricWF
lewissbaker
huixie90

Group Reviewers

Restricted Project

Summary

[libc++] use mutex in stop_token

These are the results from the benchmark provided by @lewissbaker

Numbers are higher the better

Baseline (using std::atomic)

Test1: jthread

Thread did 92610720 callback registration/deregistration in 30s
Thread did 94441617 callback registration/deregistration in 30s
Thread did 95045828 callback registration/deregistration in 30s
Thread did 92417235 callback registration/deregistration in 30s
Thread did 93926989 callback registration/deregistration in 30s
Thread did 94425982 callback registration/deregistration in 30s
Thread did 91251163 callback registration/deregistration in 30s
Thread did 92966481 callback registration/deregistration in 30s
Thread did 92799169 callback registration/deregistration in 30s
Thread did 92522792 callback registration/deregistration in 30s

Test2: async shutdown

Total iterations of 20 threads for 10s was 17164430
Total iterations of 20 threads for 10s was 15261966
Total iterations of 20 threads for 10s was 14647555
Total iterations of 20 threads for 10s was 14204384
Total iterations of 20 threads for 10s was 13803872
Total iterations of 20 threads for 10s was 13950054
Total iterations of 20 threads for 10s was 13941287
Total iterations of 20 threads for 10s was 14106324
Total iterations of 20 threads for 10s was 13442434
Total iterations of 20 threads for 10s was 13770722

Using std::mutex

Test1: jthread

Thread did 176115775 callback registration/deregistration in 30s
Thread did 175788755 callback registration/deregistration in 30s
Thread did 175913759 callback registration/deregistration in 30s
Thread did 175611145 callback registration/deregistration in 30s
Thread did 175465895 callback registration/deregistration in 30s
Thread did 176001367 callback registration/deregistration in 30s
Thread did 176113327 callback registration/deregistration in 30s
Thread did 175989687 callback registration/deregistration in 30s
Thread did 175891133 callback registration/deregistration in 30s
Thread did 174903412 callback registration/deregistration in 30s

Test2: async shutdown

Total iterations of 20 threads for 10s was 13181221
Total iterations of 20 threads for 10s was 11502741
Total iterations of 20 threads for 10s was 10966349
Total iterations of 20 threads for 10s was 10615504
Total iterations of 20 threads for 10s was 10795518
Total iterations of 20 threads for 10s was 10297964
Total iterations of 20 threads for 10s was 10800617
Total iterations of 20 threads for 10s was 10540040
Total iterations of 20 threads for 10s was 10687574
Total iterations of 20 threads for 10s was 11015713

We can see using std::mutex in the implementation has about 80-90% speed up in Test1, but 30-40% slow down in Test2

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

huixie90 created this revision.Jul 7 2023, 4:05 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2023, 4:05 AM

huixie90 requested review of this revision.Jul 7 2023, 4:05 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2023, 4:05 AM

Herald added a reviewer: Restricted Project. · View Herald Transcript

Herald added a subscriber: libcxx-commits. · View Herald Transcript

Harbormaster completed remote builds in B243729: Diff 538070.Jul 7 2023, 4:06 AM

huixie90 edited the summary of this revision. (Show Details)Jul 7 2023, 4:12 AM

huixie90 added reviewers: EricWF, lewissbaker, ldionne.

huixie90 added a subscriber: lewissbaker.

This is actually quite nice! The second test (libcxx/benchmarks/stop_token.async_shutdown.bench.cpp) is the one that's more realistic (by far), so IMO it's the one that we should try to optimize for.

However, here's the timings I got on my arm64 mac studio:

With std::atomic:
stop_token.async_shutdown.pass.cpp: Total iterations of 20 threads for 10s was 3 100 911
stop_token.jthread.pass.cpp: Thread did 1 434 114 717 callback registration/deregistration in 30s

With std::mutex:
stop_token.async_shutdown.pass.cpp: Total iterations of 20 threads for 10s was 228 318 798
stop_token.jthread.pass.cpp: Thread did 1 618 248 541 callback registration/deregistration in 30s

This is rather bad. I think there's probably a significant problem with the implementation of our atomic notify functions. I think we need to figure out that bug before we can draw any conclusions about stop_token, since the current state is just bonkers.

libcxx/benchmarks/stop_token.async_shutdown.bench.cpp
1	We should modify these benchmarks to instead use GoogleBenchmark and time how long it takes to do N registrations.
56–59

I think we can abandon this since this is now https://github.com/llvm/llvm-project/pull/69117.

GitHub <noreply@github.com> mentioned this in rG511236e07436: [libc++][test] Add `stop_token` benchmark (#69117).Oct 16 2023, 1:49 PM

[GH PR Transition] Commandeering to abandon.

ldionne abandoned this revision.Nov 3 2023, 8:27 AM

Revision Contents

Path

Size

libcxx/

benchmarks/

CMakeLists.txt

2 lines

stop_token.async_shutdown.bench.cpp

73 lines

stop_token.jthread.bench.cpp

29 lines

include/

__stop_token/

stop_state.h

106 lines

Diff 538070

libcxx/benchmarks/CMakeLists.txt

Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines	set(BENCHMARK_TESTS
formatter_float.bench.cpp		formatter_float.bench.cpp
formatter_int.bench.cpp		formatter_int.bench.cpp
function.bench.cpp		function.bench.cpp
join_view.bench.cpp		join_view.bench.cpp
lexicographical_compare_three_way.bench.cpp		lexicographical_compare_three_way.bench.cpp
map.bench.cpp		map.bench.cpp
monotonic_buffer.bench.cpp		monotonic_buffer.bench.cpp
ordered_set.bench.cpp		ordered_set.bench.cpp
		stop_token.async_shutdown.bench.cpp
		stop_token.jthread.bench.cpp
std_format_spec_string_unicode.bench.cpp		std_format_spec_string_unicode.bench.cpp
string.bench.cpp		string.bench.cpp
stringstream.bench.cpp		stringstream.bench.cpp
to_chars.bench.cpp		to_chars.bench.cpp
unordered_set_operations.bench.cpp		unordered_set_operations.bench.cpp
util_smartptr.bench.cpp		util_smartptr.bench.cpp
variant_visit_1.bench.cpp		variant_visit_1.bench.cpp
variant_visit_2.bench.cpp		variant_visit_2.bench.cpp
Show All 34 Lines

libcxx/benchmarks/stop_token.async_shutdown.bench.cpp

This file was added.

//===----------------------------------------------------------------------===//

ldionneAuthorUnsubmitted

Not Done

We should modify these benchmarks to instead use GoogleBenchmark and time how long it takes to do N registrations.

ldionne: We should modify these benchmarks to instead use GoogleBenchmark and time how long it takes to…

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

#include <chrono>

#include <iostream>

#include <numeric>

#include <stop_token>

#include <thread>

using namespace std::chrono_literals;

constexpr size_t thread_count = 20;

constexpr size_t concurrent_request_count = 1000;

std::atomic<bool> ready{false};

struct dummy_stop_callback {

void operator()() const noexcept {}

};

void thread_func(uint64_t* count, std::stop_token st) {

std::vector<std::optional<std::stop_callback<dummy_stop_callback>>> cbs(concurrent_request_count);

ready.wait(false);

std::uint32_t index = 0;

std::uint64_t local_count = 0;

while (!st.stop_requested()) {

cbs[index].emplace(st, dummy_stop_callback{});

index = (index + 1) % concurrent_request_count;

++local_count;

}

*count = local_count;

}

template <class F>

struct on_scope_exit {

on_scope_exit(F f) : f_(std::move(f)) {}

on_scope_exit(const on_scope_exit&) = delete;

on_scope_exit(on_scope_exit&&) = delete;

F f_;

~on_scope_exit() { f_(); }

};

int main() {

std::vector<std::uint64_t> counts(thread_count, 0);

std::stop_source ss;

{

std::vector<std::jthread> threads;

{

auto release_on_exit = on_scope_exit([&]() {

ready = true;

ready.notify_all();

});

ldionneAuthorUnsubmitted

Not Done

std::vector<std::jthread> threads;

{

+ // Make sure we send the ready signal to existing threads even if the creation of some thread throws an exception.

+ // Otherwise, the main thread will wait forever for the jthreads created thus far to finish.

auto release_on_exit = on_scope_exit([&]() {

ready = true;

ready.notify_all();

});

for (size_t i = 0; i < thread_count; ++i) {

ldionne:

for (size_t i = 0; i < thread_count; ++i) {

threads.emplace_back(thread_func, &counts[i], ss.get_token());

}

std::this_thread::sleep_for(std::chrono::seconds(10));

ss.request_stop();

}

std::uint64_t total_count = std::reduce(counts.begin(), counts.end());

std::cout << "Total iterations of " << thread_count << " threads for 10s was " << total_count << "\n";

}

libcxx/benchmarks/stop_token.jthread.bench.cpp

This file was added.

				//===----------------------------------------------------------------------===//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include <chrono>
				#include <iostream>
				#include <stop_token>
				#include <thread>

				using namespace std::chrono_literals;

				int main() {
				std::uint64_t count = 0;
				{
				std::jthread t([&](std::stop_token st) {
				std::uint64_t local_count = 0;
				while (!st.stop_requested()) {
				std::stop_callback cb{st, [&]() noexcept {}};
				++local_count;
				}
				count = local_count;
				});
				std::this_thread::sleep_for(30s);
				}
				std::cout << "Thread did " << count << " callback registration/deregistration in 30s\n";
				}

libcxx/include/__stop_token/stop_state.h

// -- C++ --		// -- C++ --
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef _LIBCPP___STOP_TOKEN_STOP_STATE_H		#ifndef _LIBCPP___STOP_TOKEN_STOP_STATE_H
#define _LIBCPP___STOP_TOKEN_STOP_STATE_H		#define _LIBCPP___STOP_TOKEN_STOP_STATE_H

#include <__availability>		#include <__availability>
#include <__config>		#include <__config>
#include <__stop_token/atomic_unique_lock.h>
#include <__stop_token/intrusive_list_view.h>		#include <__stop_token/intrusive_list_view.h>
#include <__thread/this_thread.h>		#include <__thread/this_thread.h>
#include <__thread/thread.h>		#include <__thread/thread.h>
#include <atomic>		#include <atomic>
#include <cstdint>		#include <cstdint>

#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)		#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
# pragma GCC system_header		# pragma GCC system_header
Show All 9 Lines	struct __stop_callback_base : __intrusive_node_base<__stop_callback_base> {

_LIBCPP_HIDE_FROM_ABI void __invoke() noexcept { __callback_fn_(this); }		_LIBCPP_HIDE_FROM_ABI void __invoke() noexcept { __callback_fn_(this); }

__callback_fn_t* __callback_fn_;		__callback_fn_t* __callback_fn_;
atomic<bool> __completed_ = false;		atomic<bool> __completed_ = false;
bool* __destroyed_ = nullptr;		bool* __destroyed_ = nullptr;
};		};

		// stop_token needs to lock with noexcept. mutex::lock can throw.
		// wrap it with a while loop and catch all exceptions
		class __nothrow_mutex_lock {
		std::mutex& __mutex_;
		bool __is_locked_;

		public:
		_LIBCPP_HIDE_FROM_ABI explicit __nothrow_mutex_lock(std::mutex& __mutex) noexcept
		: __mutex_(__mutex), __is_locked_(true) {
		__lock();
		}

		__nothrow_mutex_lock(const __nothrow_mutex_lock&) = delete;
		__nothrow_mutex_lock(__nothrow_mutex_lock&&) = delete;
		__nothrow_mutex_lock& operator=(const __nothrow_mutex_lock&) = delete;
		__nothrow_mutex_lock& operator=(__nothrow_mutex_lock&&) = delete;

		_LIBCPP_HIDE_FROM_ABI ~__nothrow_mutex_lock() {
		if (__is_locked_) {
		__unlock();
		}
		}

		_LIBCPP_HIDE_FROM_ABI bool __owns_lock() const noexcept { return __is_locked_; }

		_LIBCPP_HIDE_FROM_ABI void __lock() noexcept {
		while (true) {
		try {
		__mutex_.lock();
		break;
		} catch (...) {
		}
		}
		__is_locked_ = true;
		}

		_LIBCPP_HIDE_FROM_ABI void __unlock() noexcept {
		__mutex_.unlock(); // throws nothing
		__is_locked_ = false;
		}
		};

class __stop_state {		class __stop_state {
static constexpr uint32_t __stop_requested_bit = 1;		static constexpr uint32_t __stop_requested_bit = 1;
static constexpr uint32_t __callback_list_locked_bit = 1 << 1;		static constexpr uint32_t __stop_source_counter_shift = 1;
static constexpr uint32_t __stop_source_counter_shift = 2;

// The "stop_source counter" is not used for lifetime reference counting.		// The "stop_source counter" is not used for lifetime reference counting.
// When the number of stop_source reaches 0, the remaining stop_tokens's		// When the number of stop_source reaches 0, the remaining stop_tokens's
// stop_possible will return false. We need this counter to track this.		// stop_possible will return false. We need this counter to track this.
//		//
// The "callback list locked" bit implements the atomic_unique_lock to		// The "callback list locked" bit implements the atomic_unique_lock to
// guard the operations on the callback list		// guard the operations on the callback list
//		//
// 31 - 2 \| 1 \| 0 \|		// 31 - 1 \| 0 \|
// stop_source counter \| callback list locked \| stop_requested \|		// stop_source counter \| stop_requested \|
atomic<uint32_t> __state_ = 0;		atomic<uint32_t> __state_ = 0;

// Reference count for stop_token + stop_callback + stop_source		// Reference count for stop_token + stop_callback + stop_source
// When the counter reaches zero, the state is destroyed		// When the counter reaches zero, the state is destroyed
// It is used by __intrusive_shared_ptr, but it is stored here for better layout		// It is used by __intrusive_shared_ptr, but it is stored here for better layout
atomic<uint32_t> __ref_count_ = 0;		atomic<uint32_t> __ref_count_ = 0;
		std::mutex __mutex_;

using __state_t = uint32_t;		using __state_t = uint32_t;
using __callback_list_lock = __atomic_unique_lock<__state_t, __callback_list_locked_bit>;		using __callback_list_lock = __nothrow_mutex_lock;
using __callback_list = __intrusive_list_view<__stop_callback_base>;		using __callback_list = __intrusive_list_view<__stop_callback_base>;

__callback_list __callback_list_;		__callback_list __callback_list_;
thread::id __requesting_thread_;		thread::id __requesting_thread_;

public:		public:
_LIBCPP_HIDE_FROM_ABI __stop_state() noexcept = default;		_LIBCPP_HIDE_FROM_ABI __stop_state() noexcept = default;

Show All 25 Lines	_LIBCPP_HIDE_FROM_ABI bool __stop_possible_for_stop_token() const noexcept {
// [stoptoken.mem] false if "a stop request was not made and there are no associated stop_source objects"		// [stoptoken.mem] false if "a stop request was not made and there are no associated stop_source objects"
// Todo: Can this be std::memory_order_relaxed as the standard does not say anything except not to introduce data		// Todo: Can this be std::memory_order_relaxed as the standard does not say anything except not to introduce data
// race?		// race?
__state_t __curent_state = __state_.load(std::memory_order_acquire);		__state_t __curent_state = __state_.load(std::memory_order_acquire);
return ((__curent_state & __stop_requested_bit) != 0) \|\| ((__curent_state >> __stop_source_counter_shift) != 0);		return ((__curent_state & __stop_requested_bit) != 0) \|\| ((__curent_state >> __stop_source_counter_shift) != 0);
}		}

_LIBCPP_AVAILABILITY_SYNC _LIBCPP_HIDE_FROM_ABI bool __request_stop() noexcept {		_LIBCPP_AVAILABILITY_SYNC _LIBCPP_HIDE_FROM_ABI bool __request_stop() noexcept {
auto __cb_list_lock = __try_lock_for_request_stop();		__callback_list_lock __cb_list_lock(__mutex_);
if (!__cb_list_lock.__owns_lock()) {		auto __old = __state_.fetch_or(__stop_requested_bit, std::memory_order_release);
		if ((__old & __stop_requested_bit) == __stop_requested_bit) {
return false;		return false;
}		}

__requesting_thread_ = this_thread::get_id();		__requesting_thread_ = this_thread::get_id();

while (!__callback_list_.__empty()) {		while (!__callback_list_.__empty()) {
auto __cb = __callback_list_.__pop_front();		auto __cb = __callback_list_.__pop_front();

// allow other callbacks to be removed while invoking the current callback		// allow other callbacks to be removed while invoking the current callback
__cb_list_lock.__unlock();		__cb_list_lock.__unlock();

Show All 17 Lines	while (!__callback_list_.__empty()) {

__cb_list_lock.__lock();		__cb_list_lock.__lock();
}		}

return true;		return true;
}		}

_LIBCPP_AVAILABILITY_SYNC _LIBCPP_HIDE_FROM_ABI bool __add_callback(__stop_callback_base* __cb) noexcept {		_LIBCPP_AVAILABILITY_SYNC _LIBCPP_HIDE_FROM_ABI bool __add_callback(__stop_callback_base* __cb) noexcept {
// If it is already stop_requested. Do not try to request it again.		__callback_list_lock __cb_list_lock(__mutex_);
const auto __give_up_trying_to_lock_condition = [__cb](__state_t __state) {		auto __state = __state_.load(std::memory_order_acquire);
if ((__state & __stop_requested_bit) != 0) {		if ((__state & __stop_requested_bit) != 0) {
// already stop requested, synchronously run the callback and no need to lock the list again		// already stop requested, synchronously run the callback and no need to lock the list again
__cb->__invoke();		__cb->__invoke();
return true;		return false;
}		}
// no stop source. no need to lock the list to add the callback as it can never be invoked
return (__state >> __stop_source_counter_shift) == 0;
};

__callback_list_lock __cb_list_lock(__state_, __give_up_trying_to_lock_condition);		// no stop source. no need to lock the list to add the callback as it can never be invoked
		if ((__state >> __stop_source_counter_shift) == 0) {
if (!__cb_list_lock.__owns_lock()) {
return false;		return false;
}		}

__callback_list_.__push_front(__cb);		__callback_list_.__push_front(__cb);

return true;		return true;
// unlock here: [thread.stoptoken.intro] Registration of a callback synchronizes with the invocation of		// unlock here: [thread.stoptoken.intro] Registration of a callback synchronizes with the invocation of
// that callback.		// that callback.
// Note: this release sync with the acquire in the request_stop' __try_lock_for_request_stop		// Note: this release sync with the acquire in the request_stop' __try_lock_for_request_stop
}		}

// called by the destructor of stop_callback		// called by the destructor of stop_callback
_LIBCPP_AVAILABILITY_SYNC _LIBCPP_HIDE_FROM_ABI void __remove_callback(__stop_callback_base* __cb) noexcept {		_LIBCPP_AVAILABILITY_SYNC _LIBCPP_HIDE_FROM_ABI void __remove_callback(__stop_callback_base* __cb) noexcept {
__callback_list_lock __cb_list_lock(__state_);		__callback_list_lock __cb_list_lock(__mutex_);

// under below condition, the request_stop call just popped __cb from the list and could execute it now		// under below condition, the request_stop call just popped __cb from the list and could execute it now
bool __potentially_executing_now = __cb->__prev_ == nullptr && !__callback_list_.__is_head(__cb);		bool __potentially_executing_now = __cb->__prev_ == nullptr && !__callback_list_.__is_head(__cb);

if (__potentially_executing_now) {		if (__potentially_executing_now) {
auto __requested_thread = __requesting_thread_;		auto __requested_thread = __requesting_thread_;
__cb_list_lock.__unlock();		__cb_list_lock.__unlock();

Show All 10 Lines	if (__potentially_executing_now) {
}		}
}		}
} else {		} else {
__callback_list_.__remove(__cb);		__callback_list_.__remove(__cb);
}		}
}		}

private:		private:
_LIBCPP_AVAILABILITY_SYNC _LIBCPP_HIDE_FROM_ABI __callback_list_lock __try_lock_for_request_stop() noexcept {
// If it is already stop_requested, do not try to request stop or lock the list again.
const auto __lock_fail_condition = [](__state_t __state) { return (__state & __stop_requested_bit) != 0; };

// set locked and requested bit at the same time
const auto __after_lock_state = [](__state_t __state) {
return __state \| __callback_list_locked_bit \| __stop_requested_bit;
};

// acq because [thread.stoptoken.intro] Registration of a callback synchronizes with the invocation of that
// callback. We are going to invoke the callback after getting the lock, acquire so that we can see the
// registration of a callback (and other writes that happens-before the add_callback)
// Note: the rel (unlock) in the add_callback syncs with this acq
// rel because [thread.stoptoken.intro] A call to request_stop that returns true synchronizes with a call
// to stop_requested on an associated stop_token or stop_source object that returns true.
// We need to make sure that all writes (including user code) before request_stop will be made visible
// to the threads that waiting for `stop_requested == true`
// Note: this rel syncs with the acq in `stop_requested`
const auto __locked_ordering = std::memory_order_acq_rel;

return __callback_list_lock(__state_, __lock_fail_condition, __after_lock_state, __locked_ordering);
}

template <class _Tp>		template <class _Tp>
friend struct __intrusive_shared_ptr_traits;		friend struct __intrusive_shared_ptr_traits;
};		};

template <class _Tp>		template <class _Tp>
struct __intrusive_shared_ptr_traits;		struct __intrusive_shared_ptr_traits;

template <>		template <>
Show All 11 Lines