This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Do not yield from __sp_mut::lock()
ClosedPublic

Authored by ldionne on Jun 2 2022, 7:42 AM.

Details

Reviewers
None
Group Reviewers
Restricted Project
Commits
rG5b386ac912ea: [libc++] Do not yield from __sp_mut::lock()
Summary

Instead of trying to be clever and design our own locking primitive,
simply rely on the OS-provided implementation to do the right thing.

Indeed, manually yielding to the OS does not provide the necessary
information for it to make good prioritization decisions. For example,
if a thread with higher priority yields while waiting for a lock held
by a thread with lower priority but the system is contended, it is
possible for the thread with lower priority to not run until the higher
priority thread has yielded 16 times and goes for __libcpp_mutex_lock().
Once that happens, the OS can bump the priority of the thread that
currently holds the lock to unblock everyone. So instead, we might as
well give the system all the information from the start so it can make
appropriate decisions.

As a fly-by change, also increase the number of locks in the table.
The size increase is modest, but has the potential to half the amount
of contention on those locks.

rdar://93598606

Diff Detail

Event Timeline

ldionne created this revision.Jun 2 2022, 7:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2022, 7:42 AM
ldionne requested review of this revision.Jun 2 2022, 7:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2022, 7:42 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
ldionne added a subscriber: Restricted Project.Jun 2 2022, 7:43 AM

This patch was put together in collaboration with a performance engineer on our side. It was tested with a use case that was previously problematic and this patch did solve the issue. We believe this should be an improvement for all OSes if their pthread_mutex_lock is reasonable (which is the case on Linux and Apple at least), but it would definitely be interesting to have other folks' eyes on this to get other opinions. Adding libc++ vendors for awareness.

Mordante added inline comments.
libcxx/src/memory.cpp
141

Since you change this value it might be a good moment to document why 32 is "the right" value.

ldionne added inline comments.Jun 3 2022, 10:55 AM
libcxx/src/memory.cpp
141

Honestly, I just decided to double it to decrease contention a bit, but it wasn't done in any scientific kind of way.

Mordante added inline comments.Jun 3 2022, 11:35 AM
libcxx/src/memory.cpp
141

Fair point. Since you looked at it with a performance engineer I hoped there was some additional info on this change.
If there isn't more info then it's fine by me too.

dim added a subscriber: dim.Jun 4 2022, 7:41 AM

This patch was put together in collaboration with a performance engineer on our side. It was tested with a use case that was previously problematic and this patch did solve the issue. We believe this should be an improvement for all OSes if their pthread_mutex_lock is reasonable (which is the case on Linux and Apple at least)

What would you consider "reasonable"? Or maybe I should ask the question differently: what was the original reason for adding the while loop in the first place?

dim added a comment.Jun 4 2022, 7:45 AM

To answer my own question partially, the original commit is Howard's: rG088e37c77aafaec5ead8fbe7ebf918265e6b86f2:

Despite my pathological distrust of spin locks, the number just don't lie. I've put a small spin in __sp_mut::lock() on std::mutex::try_lock(), which is testing quite well. In my experience, putting in a yield for every failed iteration is also a major performance booster. This change makes one of the performance tests I was using (a highly contended one) run about 20 times faster.

But this was on macOS at the time, so ~2012...

To answer my own question partially, the original commit is Howard's: rG088e37c77aafaec5ead8fbe7ebf918265e6b86f2:

Despite my pathological distrust of spin locks, the number just don't lie. I've put a small spin in __sp_mut::lock() on std::mutex::try_lock(), which is testing quite well. In my experience, putting in a yield for every failed iteration is also a major performance booster. This change makes one of the performance tests I was using (a highly contended one) run about 20 times faster.

But this was on macOS at the time, so ~2012...

Right, and macOS's locking primitives have improved a lot since then. According to our perf engineers, the current pattern of yielding is actually actively harmful for performance on our platform, and apparently on Linux too.

Or maybe I should ask the question differently: what was the original reason for adding the while loop in the first place?

One way to see this is that Howard originally added this complexity layer on top of the OS's mutex because experimentation showed it to be better. However, as OSes improve their threading primitives, it becomes better to give as much information to the OS as possible, which is what this patch does.

libcxx/src/memory.cpp
141

The reasoning is basically that by doubling the number of mutexes, we should roughly half the amount of contention. Since there's virtually no size increase by doubling the number of mutexes, it's an easy tradeoff to make.

With 16 mutexes:

$ bloaty build/default/lib/libc++.dylib
    FILE SIZE        VM SIZE
 --------------  --------------
  37.3%   906Ki  37.0%   906Ki    String Table
  30.2%   733Ki  29.9%   733Ki    __TEXT,__text
  19.6%   476Ki  19.5%   476Ki    Symbol Table
   5.2%   125Ki   5.1%   125Ki    __TEXT,__const
   3.2%  77.9Ki   3.3%  80.0Ki    [__LINKEDIT]
   0.0%       0   1.0%  23.3Ki    __DATA,__bss
   0.8%  18.7Ki   0.8%  18.7Ki    [__TEXT]
   0.8%  18.5Ki   0.8%  18.5Ki    __DATA_CONST,__const
   0.7%  16.7Ki   0.7%  16.7Ki    __TEXT,__gcc_except_tab
   0.6%  14.3Ki   0.2%  5.52Ki    [__DATA]
   0.5%  11.9Ki   0.5%  11.9Ki    [__DATA_CONST]
   0.4%  10.2Ki   0.4%  10.2Ki    __TEXT,__unwind_info
   0.3%  6.58Ki   0.3%  6.58Ki    Function Start Addresses
   0.2%  4.93Ki   0.2%  4.93Ki    __TEXT,__cstring
   0.1%  1.71Ki   0.1%  1.71Ki    __DATA,__data
   0.1%  1.70Ki   0.1%  1.66Ki    [Mach-O Headers]
   0.1%  1.59Ki   0.1%  1.59Ki    __DATA_CONST,__got
   0.0%       0   0.1%  1.45Ki    __DATA,__common
   0.1%  1.41Ki   0.1%  1.41Ki    Indirect Symbol Table
   0.0%     954   0.0%     954    __TEXT,__stubs
   0.0%     496   0.0%     496    [2 Others]
 100.0%  2.37Mi 100.0%  2.39Mi    TOTAL

$ bloaty build/default/lib/libc++.a
    FILE SIZE        VM SIZE
 --------------  --------------
 100.0%  7.43Mi   NAN%       0    [AR Non-ELF Member File]
   0.0%  2.82Ki   NAN%       0    [AR Headers]
 100.0%  7.44Mi 100.0%       0    TOTAL

With 32 mutexes:

$ bloaty build/default/lib/libc++.dylib
    FILE SIZE        VM SIZE
 --------------  --------------
  37.3%   906Ki  37.0%   906Ki    String Table
  30.2%   733Ki  29.9%   733Ki    __TEXT,__text
  19.6%   476Ki  19.5%   476Ki    Symbol Table
   5.2%   125Ki   5.1%   125Ki    __TEXT,__const
   3.2%  77.9Ki   3.3%  80.2Ki    [__LINKEDIT]
   0.0%       0   1.0%  23.3Ki    __DATA,__bss
   0.8%  18.8Ki   0.8%  18.8Ki    [__TEXT]
   0.8%  18.5Ki   0.8%  18.5Ki    __DATA_CONST,__const
   0.7%  16.7Ki   0.7%  16.7Ki    __TEXT,__gcc_except_tab
   0.5%  13.2Ki   0.2%  4.39Ki    [__DATA]
   0.5%  11.9Ki   0.5%  11.9Ki    [__DATA_CONST]
   0.4%  10.2Ki   0.4%  10.2Ki    __TEXT,__unwind_info
   0.3%  6.57Ki   0.3%  6.57Ki    Function Start Addresses
   0.2%  4.93Ki   0.2%  4.93Ki    __TEXT,__cstring
   0.1%  2.84Ki   0.1%  2.84Ki    __DATA,__data
   0.1%  1.70Ki   0.1%  1.66Ki    [Mach-O Headers]
   0.1%  1.59Ki   0.1%  1.59Ki    __DATA_CONST,__got
   0.0%       0   0.1%  1.45Ki    __DATA,__common
   0.1%  1.41Ki   0.1%  1.41Ki    Indirect Symbol Table
   0.0%     954   0.0%     954    __TEXT,__stubs
   0.0%     496   0.0%     496    [2 Others]
 100.0%  2.37Mi 100.0%  2.39Mi    TOTAL

$ bloaty build/default/lib/libc++.a
    FILE SIZE        VM SIZE
 --------------  --------------
 100.0%  7.43Mi   NAN%       0    [AR Non-ELF Member File]
   0.0%  2.82Ki   NAN%       0    [AR Headers]
 100.0%  7.44Mi 100.0%       0    TOTAL
ldionne accepted this revision as: Restricted Project.Jun 13 2022, 8:48 AM

I'll ship this now based on the analysis that we did -- if someone has comments/objections/suggestions, please let me know and we can change things post-commit.

This revision was not accepted when it landed; it landed in state Needs Review.Jun 13 2022, 8:48 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.