This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libcxx/src/
-
src/
2/4
memory.cpp

Differential D126882

[libc++] Do not yield from __sp_mut::lock()
ClosedPublic

Authored by ldionne on Jun 2 2022, 7:42 AM.

Download Raw Diff

Details

Reviewers

None

Group Reviewers

Restricted Project

Commits

rG5b386ac912ea: [libc++] Do not yield from __sp_mut::lock()

Summary

Instead of trying to be clever and design our own locking primitive,
simply rely on the OS-provided implementation to do the right thing.

Indeed, manually yielding to the OS does not provide the necessary
information for it to make good prioritization decisions. For example,
if a thread with higher priority yields while waiting for a lock held
by a thread with lower priority but the system is contended, it is
possible for the thread with lower priority to not run until the higher
priority thread has yielded 16 times and goes for __libcpp_mutex_lock().
Once that happens, the OS can bump the priority of the thread that
currently holds the lock to unblock everyone. So instead, we might as
well give the system all the information from the start so it can make
appropriate decisions.

As a fly-by change, also increase the number of locks in the table.
The size increase is modest, but has the potential to half the amount
of contention on those locks.

rdar://93598606

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ldionne created this revision.Jun 2 2022, 7:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2022, 7:42 AM

ldionne requested review of this revision.Jun 2 2022, 7:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2022, 7:42 AM

Herald added a reviewer: Restricted Project. · View Herald Transcript

Herald added a subscriber: libcxx-commits. · View Herald Transcript

This patch was put together in collaboration with a performance engineer on our side. It was tested with a use case that was previously problematic and this patch did solve the issue. We believe this should be an improvement for all OSes if their pthread_mutex_lock is reasonable (which is the case on Linux and Apple at least), but it would definitely be interesting to have other folks' eyes on this to get other opinions. Adding libc++ vendors for awareness.

Mordante added a subscriber: Mordante.Jun 2 2022, 9:47 AM

Mordante added inline comments.

libcxx/src/memory.cpp
141	Since you change this value it might be a good moment to document why 32 is "the right" value.

Harbormaster completed remote builds in B167517: Diff 433750.Jun 2 2022, 12:47 PM

ldionne added inline comments.Jun 3 2022, 10:55 AM

libcxx/src/memory.cpp
141	Honestly, I just decided to double it to decrease contention a bit, but it wasn't done in any scientific kind of way.

Mordante added inline comments.Jun 3 2022, 11:35 AM

libcxx/src/memory.cpp
141	Fair point. Since you looked at it with a performance engineer I hoped there was some additional info on this change. If there isn't more info then it's fine by me too.

In D126882#3553338, @ldionne wrote:

This patch was put together in collaboration with a performance engineer on our side. It was tested with a use case that was previously problematic and this patch did solve the issue. We believe this should be an improvement for all OSes if their pthread_mutex_lock is reasonable (which is the case on Linux and Apple at least)

What would you consider "reasonable"? Or maybe I should ask the question differently: what was the original reason for adding the while loop in the first place?

To answer my own question partially, the original commit is Howard's: rG088e37c77aafaec5ead8fbe7ebf918265e6b86f2:

Despite my pathological distrust of spin locks, the number just don't lie. I've put a small spin in __sp_mut::lock() on std::mutex::try_lock(), which is testing quite well. In my experience, putting in a yield for every failed iteration is also a major performance booster. This change makes one of the performance tests I was using (a highly contended one) run about 20 times faster.

But this was on macOS at the time, so ~2012...

In D126882#3558306, @dim wrote:

To answer my own question partially, the original commit is Howard's: rG088e37c77aafaec5ead8fbe7ebf918265e6b86f2:

Despite my pathological distrust of spin locks, the number just don't lie. I've put a small spin in __sp_mut::lock() on std::mutex::try_lock(), which is testing quite well. In my experience, putting in a yield for every failed iteration is also a major performance booster. This change makes one of the performance tests I was using (a highly contended one) run about 20 times faster.

But this was on macOS at the time, so ~2012...

Right, and macOS's locking primitives have improved a lot since then. According to our perf engineers, the current pattern of yielding is actually actively harmful for performance on our platform, and apparently on Linux too.

In D126882#3558302, @dim wrote:

Or maybe I should ask the question differently: what was the original reason for adding the while loop in the first place?

One way to see this is that Howard originally added this complexity layer on top of the OS's mutex because experimentation showed it to be better. However, as OSes improve their threading primitives, it becomes better to give as much information to the OS as possible, which is what this patch does.

libcxx/src/memory.cpp

141

The reasoning is basically that by doubling the number of mutexes, we should roughly half the amount of contention. Since there's virtually no size increase by doubling the number of mutexes, it's an easy tradeoff to make.

With 16 mutexes:

$ bloaty build/default/lib/libc++.dylib
    FILE SIZE        VM SIZE
 --------------  --------------
  37.3%   906Ki  37.0%   906Ki    String Table
  30.2%   733Ki  29.9%   733Ki    __TEXT,__text
  19.6%   476Ki  19.5%   476Ki    Symbol Table
   5.2%   125Ki   5.1%   125Ki    __TEXT,__const
   3.2%  77.9Ki   3.3%  80.0Ki    [__LINKEDIT]
   0.0%       0   1.0%  23.3Ki    __DATA,__bss
   0.8%  18.7Ki   0.8%  18.7Ki    [__TEXT]
   0.8%  18.5Ki   0.8%  18.5Ki    __DATA_CONST,__const
   0.7%  16.7Ki   0.7%  16.7Ki    __TEXT,__gcc_except_tab
   0.6%  14.3Ki   0.2%  5.52Ki    [__DATA]
   0.5%  11.9Ki   0.5%  11.9Ki    [__DATA_CONST]
   0.4%  10.2Ki   0.4%  10.2Ki    __TEXT,__unwind_info
   0.3%  6.58Ki   0.3%  6.58Ki    Function Start Addresses
   0.2%  4.93Ki   0.2%  4.93Ki    __TEXT,__cstring
   0.1%  1.71Ki   0.1%  1.71Ki    __DATA,__data
   0.1%  1.70Ki   0.1%  1.66Ki    [Mach-O Headers]
   0.1%  1.59Ki   0.1%  1.59Ki    __DATA_CONST,__got
   0.0%       0   0.1%  1.45Ki    __DATA,__common
   0.1%  1.41Ki   0.1%  1.41Ki    Indirect Symbol Table
   0.0%     954   0.0%     954    __TEXT,__stubs
   0.0%     496   0.0%     496    [2 Others]
 100.0%  2.37Mi 100.0%  2.39Mi    TOTAL

$ bloaty build/default/lib/libc++.a
    FILE SIZE        VM SIZE
 --------------  --------------
 100.0%  7.43Mi   NAN%       0    [AR Non-ELF Member File]
   0.0%  2.82Ki   NAN%       0    [AR Headers]
 100.0%  7.44Mi 100.0%       0    TOTAL

With 32 mutexes:

$ bloaty build/default/lib/libc++.dylib
    FILE SIZE        VM SIZE
 --------------  --------------
  37.3%   906Ki  37.0%   906Ki    String Table
  30.2%   733Ki  29.9%   733Ki    __TEXT,__text
  19.6%   476Ki  19.5%   476Ki    Symbol Table
   5.2%   125Ki   5.1%   125Ki    __TEXT,__const
   3.2%  77.9Ki   3.3%  80.2Ki    [__LINKEDIT]
   0.0%       0   1.0%  23.3Ki    __DATA,__bss
   0.8%  18.8Ki   0.8%  18.8Ki    [__TEXT]
   0.8%  18.5Ki   0.8%  18.5Ki    __DATA_CONST,__const
   0.7%  16.7Ki   0.7%  16.7Ki    __TEXT,__gcc_except_tab
   0.5%  13.2Ki   0.2%  4.39Ki    [__DATA]
   0.5%  11.9Ki   0.5%  11.9Ki    [__DATA_CONST]
   0.4%  10.2Ki   0.4%  10.2Ki    __TEXT,__unwind_info
   0.3%  6.57Ki   0.3%  6.57Ki    Function Start Addresses
   0.2%  4.93Ki   0.2%  4.93Ki    __TEXT,__cstring
   0.1%  2.84Ki   0.1%  2.84Ki    __DATA,__data
   0.1%  1.70Ki   0.1%  1.66Ki    [Mach-O Headers]
   0.1%  1.59Ki   0.1%  1.59Ki    __DATA_CONST,__got
   0.0%       0   0.1%  1.45Ki    __DATA,__common
   0.1%  1.41Ki   0.1%  1.41Ki    Indirect Symbol Table
   0.0%     954   0.0%     954    __TEXT,__stubs
   0.0%     496   0.0%     496    [2 Others]
 100.0%  2.37Mi 100.0%  2.39Mi    TOTAL

$ bloaty build/default/lib/libc++.a
    FILE SIZE        VM SIZE
 --------------  --------------
 100.0%  7.43Mi   NAN%       0    [AR Non-ELF Member File]
   0.0%  2.82Ki   NAN%       0    [AR Headers]
 100.0%  7.44Mi 100.0%       0    TOTAL

I'll ship this now based on the analysis that we did -- if someone has comments/objections/suggestions, please let me know and we can change things post-commit.

This revision was not accepted when it landed; it landed in state Needs Review.Jun 13 2022, 8:48 AM

This revision was landed with ongoing or failed builds.

Closed by commit rG5b386ac912ea: [libc++] Do not yield from __sp_mut::lock() (authored by ldionne). · Explain Why

This revision was automatically updated to reflect the committed changes.

ldionne added a commit: rG5b386ac912ea: [libc++] Do not yield from __sp_mut::lock().

Revision Contents

Path

Size

libcxx/

src/

memory.cpp

23 lines

Diff 436426

libcxx/src/memory.cpp

	Show First 20 Lines • Show All 132 Lines • ▼ Show 20 Lines
	const void*			const void*
	__shared_weak_count::__get_deleter(const type_info&) const noexcept			__shared_weak_count::__get_deleter(const type_info&) const noexcept
	{			{
	return nullptr;			return nullptr;
	}			}

	#if !defined(_LIBCPP_HAS_NO_THREADS)			#if !defined(_LIBCPP_HAS_NO_THREADS)

	static constexpr std::size_t __sp_mut_count = 16;			static constexpr std::size_t __sp_mut_count = 32;
				MordanteUnsubmitted Not Done Reply Inline Actions Since you change this value it might be a good moment to document why 32 is "the right" value. Mordante: Since you change this value it might be a good moment to document why 32 is "the right" value.
				ldionneAuthorUnsubmitted Done Reply Inline Actions Honestly, I just decided to double it to decrease contention a bit, but it wasn't done in any scientific kind of way. ldionne: Honestly, I just decided to double it to decrease contention a bit, but it wasn't done in any…
				MordanteUnsubmitted Not Done Reply Inline Actions Fair point. Since you looked at it with a performance engineer I hoped there was some additional info on this change. If there isn't more info then it's fine by me too. Mordante: Fair point. Since you looked at it with a performance engineer I hoped there was some…
				ldionneAuthorUnsubmitted Done Reply Inline Actions The reasoning is basically that by doubling the number of mutexes, we should roughly half the amount of contention. Since there's virtually no size increase by doubling the number of mutexes, it's an easy tradeoff to make. With 16 mutexes: $ bloaty build/default/lib/libc++.dylib FILE SIZE VM SIZE -------------- -------------- 37.3% 906Ki 37.0% 906Ki String Table 30.2% 733Ki 29.9% 733Ki __TEXT,__text 19.6% 476Ki 19.5% 476Ki Symbol Table 5.2% 125Ki 5.1% 125Ki __TEXT,__const 3.2% 77.9Ki 3.3% 80.0Ki [__LINKEDIT] 0.0% 0 1.0% 23.3Ki __DATA,__bss 0.8% 18.7Ki 0.8% 18.7Ki [__TEXT] 0.8% 18.5Ki 0.8% 18.5Ki __DATA_CONST,__const 0.7% 16.7Ki 0.7% 16.7Ki __TEXT,__gcc_except_tab 0.6% 14.3Ki 0.2% 5.52Ki [__DATA] 0.5% 11.9Ki 0.5% 11.9Ki [__DATA_CONST] 0.4% 10.2Ki 0.4% 10.2Ki __TEXT,__unwind_info 0.3% 6.58Ki 0.3% 6.58Ki Function Start Addresses 0.2% 4.93Ki 0.2% 4.93Ki __TEXT,__cstring 0.1% 1.71Ki 0.1% 1.71Ki __DATA,__data 0.1% 1.70Ki 0.1% 1.66Ki [Mach-O Headers] 0.1% 1.59Ki 0.1% 1.59Ki __DATA_CONST,__got 0.0% 0 0.1% 1.45Ki __DATA,__common 0.1% 1.41Ki 0.1% 1.41Ki Indirect Symbol Table 0.0% 954 0.0% 954 __TEXT,__stubs 0.0% 496 0.0% 496 [2 Others] 100.0% 2.37Mi 100.0% 2.39Mi TOTAL $ bloaty build/default/lib/libc++.a FILE SIZE VM SIZE -------------- -------------- 100.0% 7.43Mi NAN% 0 [AR Non-ELF Member File] 0.0% 2.82Ki NAN% 0 [AR Headers] 100.0% 7.44Mi 100.0% 0 TOTAL With 32 mutexes: $ bloaty build/default/lib/libc++.dylib FILE SIZE VM SIZE -------------- -------------- 37.3% 906Ki 37.0% 906Ki String Table 30.2% 733Ki 29.9% 733Ki __TEXT,__text 19.6% 476Ki 19.5% 476Ki Symbol Table 5.2% 125Ki 5.1% 125Ki __TEXT,__const 3.2% 77.9Ki 3.3% 80.2Ki [__LINKEDIT] 0.0% 0 1.0% 23.3Ki __DATA,__bss 0.8% 18.8Ki 0.8% 18.8Ki [__TEXT] 0.8% 18.5Ki 0.8% 18.5Ki __DATA_CONST,__const 0.7% 16.7Ki 0.7% 16.7Ki __TEXT,__gcc_except_tab 0.5% 13.2Ki 0.2% 4.39Ki [__DATA] 0.5% 11.9Ki 0.5% 11.9Ki [__DATA_CONST] 0.4% 10.2Ki 0.4% 10.2Ki __TEXT,__unwind_info 0.3% 6.57Ki 0.3% 6.57Ki Function Start Addresses 0.2% 4.93Ki 0.2% 4.93Ki __TEXT,__cstring 0.1% 2.84Ki 0.1% 2.84Ki __DATA,__data 0.1% 1.70Ki 0.1% 1.66Ki [Mach-O Headers] 0.1% 1.59Ki 0.1% 1.59Ki __DATA_CONST,__got 0.0% 0 0.1% 1.45Ki __DATA,__common 0.1% 1.41Ki 0.1% 1.41Ki Indirect Symbol Table 0.0% 954 0.0% 954 __TEXT,__stubs 0.0% 496 0.0% 496 [2 Others] 100.0% 2.37Mi 100.0% 2.39Mi TOTAL $ bloaty build/default/lib/libc++.a FILE SIZE VM SIZE -------------- -------------- 100.0% 7.43Mi NAN% 0 [AR Non-ELF Member File] 0.0% 2.82Ki NAN% 0 [AR Headers] 100.0% 7.44Mi 100.0% 0 TOTAL ldionne: The reasoning is basically that by doubling the number of mutexes, we should roughly half the…
	static constinit __libcpp_mutex_t mut_back[__sp_mut_count] =			static constinit __libcpp_mutex_t mut_back[__sp_mut_count] =
	{			{
	_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER,			_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER,
	_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER,			_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER,
	_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER,			_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER,
				_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER,
				_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER,
				_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER,
				_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER,
	_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER			_LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER, _LIBCPP_MUTEX_INITIALIZER
	};			};

	_LIBCPP_CONSTEXPR __sp_mut::__sp_mut(void* p) noexcept			_LIBCPP_CONSTEXPR __sp_mut::__sp_mut(void* p) noexcept
	: __lx(p)			: __lx(p)
	{			{
	}			}

	void			void
	__sp_mut::lock() noexcept			__sp_mut::lock() noexcept
	{			{
	auto m = static_cast<__libcpp_mutex_t*>(__lx);			auto m = static_cast<__libcpp_mutex_t*>(__lx);
	unsigned count = 0;
	while (!__libcpp_mutex_trylock(m))
	{
	if (++count > 16)
	{
	__libcpp_mutex_lock(m);			__libcpp_mutex_lock(m);
	break;
	}
	this_thread::yield();
	}
	}			}

	void			void
	__sp_mut::unlock() noexcept			__sp_mut::unlock() noexcept
	{			{
	__libcpp_mutex_unlock(static_cast<__libcpp_mutex_t*>(__lx));			__libcpp_mutex_unlock(static_cast<__libcpp_mutex_t*>(__lx));
	}			}

	__sp_mut&			__sp_mut&
	__get_sp_mut(const void* p)			__get_sp_mut(const void* p)
	{			{
	static constinit __sp_mut muts[__sp_mut_count] = {			static constinit __sp_mut muts[__sp_mut_count] = {
	&mut_back[ 0], &mut_back[ 1], &mut_back[ 2], &mut_back[ 3],			&mut_back[ 0], &mut_back[ 1], &mut_back[ 2], &mut_back[ 3],
	&mut_back[ 4], &mut_back[ 5], &mut_back[ 6], &mut_back[ 7],			&mut_back[ 4], &mut_back[ 5], &mut_back[ 6], &mut_back[ 7],
	&mut_back[ 8], &mut_back[ 9], &mut_back[10], &mut_back[11],			&mut_back[ 8], &mut_back[ 9], &mut_back[10], &mut_back[11],
	&mut_back[12], &mut_back[13], &mut_back[14], &mut_back[15]			&mut_back[12], &mut_back[13], &mut_back[14], &mut_back[15],
				&mut_back[16], &mut_back[17], &mut_back[18], &mut_back[19],
				&mut_back[20], &mut_back[21], &mut_back[22], &mut_back[23],
				&mut_back[24], &mut_back[25], &mut_back[26], &mut_back[27],
				&mut_back[28], &mut_back[29], &mut_back[30], &mut_back[31]
	};			};
	return muts[hash<const void*>()(p) & (__sp_mut_count-1)];			return muts[hash<const void*>()(p) & (__sp_mut_count-1)];
	}			}

	#endif // !defined(_LIBCPP_HAS_NO_THREADS)			#endif // !defined(_LIBCPP_HAS_NO_THREADS)

	void*			void*
	align(size_t alignment, size_t size, void*& ptr, size_t& space)			align(size_t alignment, size_t size, void*& ptr, size_t& space)
	Show All 18 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Do not yield from __sp_mut::lock()ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 436426

libcxx/src/memory.cpp

[libc++] Do not yield from __sp_mut::lock()
ClosedPublic