This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Make std::mutex use ADAPTIVE_NP intializer if available on gnu
Needs RevisionPublic

Authored by goldstein.w.n on Apr 18 2023, 7:56 PM.

Details

Reviewers
ldionne
Mordante
philnik
Group Reviewers
Restricted Project
Summary

The most performant pthread_mutex_t configuration in GLIBC is
PTHREAD_MUTEX_ADAPTIVE_NP.

The default is timed mutex which is implemented almost entirely
through the futex syscall (no attempt at looping for any amount of
time in userland). This adds has a high context switch cost in almost
all situations.

The ADAPTIVE_NP implementation, on the other hand, progresses through
exponential backoff before going to the futex allowing for lower
latency for shorter waits with extra minimal overhead or memory
traffic for longer waits.

On my Icelake machine running the GLIBC mutex benchmarks suite for N=5
runs.

There is a 12% performance improvement for low contention cases
(nthreads <= nprocs / 2) and a 2.5% performance improvement in high
contention cases (nthreads > nprocs / 2)

Diff Detail

Event Timeline

goldstein.w.n created this revision.Apr 18 2023, 7:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 7:56 PM
goldstein.w.n requested review of this revision.Apr 18 2023, 7:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 7:56 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript

Here are the full results for GLIBC benchmark suite:

                                                         Lower=better
Benchmark Config                                        ,Speedup (ADAPTIVE_NP / TIMED_NP)
type=adaptive/non_crt_len=1/crt_len=0/threads=2         ,0.605
type=adaptive/non_crt_len=1/crt_len=0/threads=4         ,0.861
type=adaptive/non_crt_len=1/crt_len=0/threads=8         ,1.140
type=adaptive/non_crt_len=1/crt_len=0/threads=16        ,1.161
type=adaptive/non_crt_len=1/crt_len=0/threads=20        ,1.180
type=adaptive/non_crt_len=1/crt_len=1/threads=2         ,0.658
type=adaptive/non_crt_len=1/crt_len=1/threads=4         ,0.830
type=adaptive/non_crt_len=1/crt_len=1/threads=8         ,1.092
type=adaptive/non_crt_len=1/crt_len=1/threads=16        ,1.119
type=adaptive/non_crt_len=1/crt_len=1/threads=20        ,1.120
type=adaptive/non_crt_len=1/crt_len=2/threads=2         ,0.585
type=adaptive/non_crt_len=1/crt_len=2/threads=4         ,0.735
type=adaptive/non_crt_len=1/crt_len=2/threads=8         ,1.019
type=adaptive/non_crt_len=1/crt_len=2/threads=16        ,1.095
type=adaptive/non_crt_len=1/crt_len=2/threads=20        ,1.098
type=adaptive/non_crt_len=1/crt_len=4/threads=2         ,0.637
type=adaptive/non_crt_len=1/crt_len=4/threads=4         ,0.746
type=adaptive/non_crt_len=1/crt_len=4/threads=8         ,0.966
type=adaptive/non_crt_len=1/crt_len=4/threads=16        ,1.020
type=adaptive/non_crt_len=1/crt_len=4/threads=20        ,1.034
type=adaptive/non_crt_len=1/crt_len=8/threads=2         ,0.512
type=adaptive/non_crt_len=1/crt_len=8/threads=4         ,0.781
type=adaptive/non_crt_len=1/crt_len=8/threads=8         ,0.906
type=adaptive/non_crt_len=1/crt_len=8/threads=16        ,0.951
type=adaptive/non_crt_len=1/crt_len=8/threads=20        ,0.951
type=adaptive/non_crt_len=1/crt_len=16/threads=2        ,0.860
type=adaptive/non_crt_len=1/crt_len=16/threads=4        ,0.867
type=adaptive/non_crt_len=1/crt_len=16/threads=8        ,0.893
type=adaptive/non_crt_len=1/crt_len=16/threads=16       ,0.901
type=adaptive/non_crt_len=1/crt_len=16/threads=20       ,0.903
type=adaptive/non_crt_len=1/crt_len=32/threads=2        ,0.939
type=adaptive/non_crt_len=1/crt_len=32/threads=4        ,0.845
type=adaptive/non_crt_len=1/crt_len=32/threads=8        ,0.901
type=adaptive/non_crt_len=1/crt_len=32/threads=16       ,0.960
type=adaptive/non_crt_len=1/crt_len=32/threads=20       ,0.955
type=adaptive/non_crt_len=1/crt_len=64/threads=2        ,0.941
type=adaptive/non_crt_len=1/crt_len=64/threads=4        ,0.837
type=adaptive/non_crt_len=1/crt_len=64/threads=8        ,1.000
type=adaptive/non_crt_len=1/crt_len=64/threads=16       ,0.967
type=adaptive/non_crt_len=1/crt_len=64/threads=20       ,0.949
type=adaptive/non_crt_len=1/crt_len=128/threads=2       ,0.992
type=adaptive/non_crt_len=1/crt_len=128/threads=4       ,0.984
type=adaptive/non_crt_len=1/crt_len=128/threads=8       ,0.997
type=adaptive/non_crt_len=1/crt_len=128/threads=16      ,0.994
type=adaptive/non_crt_len=1/crt_len=128/threads=20      ,0.989
type=adaptive/non_crt_len=32/crt_len=0/threads=2        ,0.981
type=adaptive/non_crt_len=32/crt_len=0/threads=4        ,0.914
type=adaptive/non_crt_len=32/crt_len=0/threads=8        ,0.995
type=adaptive/non_crt_len=32/crt_len=0/threads=16       ,1.049
type=adaptive/non_crt_len=32/crt_len=0/threads=20       ,1.055
type=adaptive/non_crt_len=32/crt_len=1/threads=2        ,0.989
type=adaptive/non_crt_len=32/crt_len=1/threads=4        ,0.924
type=adaptive/non_crt_len=32/crt_len=1/threads=8        ,0.983
type=adaptive/non_crt_len=32/crt_len=1/threads=16       ,1.044
type=adaptive/non_crt_len=32/crt_len=1/threads=20       ,1.037
type=adaptive/non_crt_len=32/crt_len=2/threads=2        ,0.996
type=adaptive/non_crt_len=32/crt_len=2/threads=4        ,0.937
type=adaptive/non_crt_len=32/crt_len=2/threads=8        ,0.973
type=adaptive/non_crt_len=32/crt_len=2/threads=16       ,1.040
type=adaptive/non_crt_len=32/crt_len=2/threads=20       ,1.038
type=adaptive/non_crt_len=32/crt_len=4/threads=2        ,1.005
type=adaptive/non_crt_len=32/crt_len=4/threads=4        ,0.951
type=adaptive/non_crt_len=32/crt_len=4/threads=8        ,0.964
type=adaptive/non_crt_len=32/crt_len=4/threads=16       ,1.006
type=adaptive/non_crt_len=32/crt_len=4/threads=20       ,0.999
type=adaptive/non_crt_len=32/crt_len=8/threads=2        ,0.993
type=adaptive/non_crt_len=32/crt_len=8/threads=4        ,0.883
type=adaptive/non_crt_len=32/crt_len=8/threads=8        ,0.924
type=adaptive/non_crt_len=32/crt_len=8/threads=16       ,0.964
type=adaptive/non_crt_len=32/crt_len=8/threads=20       ,0.961
type=adaptive/non_crt_len=32/crt_len=16/threads=2       ,0.933
type=adaptive/non_crt_len=32/crt_len=16/threads=4       ,0.904
type=adaptive/non_crt_len=32/crt_len=16/threads=8       ,0.909
type=adaptive/non_crt_len=32/crt_len=16/threads=16      ,0.924
type=adaptive/non_crt_len=32/crt_len=16/threads=20      ,0.919
type=adaptive/non_crt_len=32/crt_len=32/threads=2       ,0.922
type=adaptive/non_crt_len=32/crt_len=32/threads=4       ,0.918
type=adaptive/non_crt_len=32/crt_len=32/threads=8       ,0.913
type=adaptive/non_crt_len=32/crt_len=32/threads=16      ,0.978
type=adaptive/non_crt_len=32/crt_len=32/threads=20      ,0.959
type=adaptive/non_crt_len=32/crt_len=64/threads=2       ,0.991
type=adaptive/non_crt_len=32/crt_len=64/threads=4       ,0.979
type=adaptive/non_crt_len=32/crt_len=64/threads=8       ,0.952
type=adaptive/non_crt_len=32/crt_len=64/threads=16      ,0.879
type=adaptive/non_crt_len=32/crt_len=64/threads=20      ,0.896
type=adaptive/non_crt_len=32/crt_len=128/threads=2      ,0.912
type=adaptive/non_crt_len=32/crt_len=128/threads=4      ,0.979
type=adaptive/non_crt_len=32/crt_len=128/threads=8      ,0.947
type=adaptive/non_crt_len=32/crt_len=128/threads=16     ,0.956
type=adaptive/non_crt_len=32/crt_len=128/threads=20     ,0.954
type=adaptive/non_crt_len=128/crt_len=0/threads=2       ,1.005
type=adaptive/non_crt_len=128/crt_len=0/threads=4       ,0.988
type=adaptive/non_crt_len=128/crt_len=0/threads=8       ,0.850
type=adaptive/non_crt_len=128/crt_len=0/threads=16      ,1.045
type=adaptive/non_crt_len=128/crt_len=0/threads=20      ,1.039
type=adaptive/non_crt_len=128/crt_len=1/threads=2       ,0.990
type=adaptive/non_crt_len=128/crt_len=1/threads=4       ,0.988
type=adaptive/non_crt_len=128/crt_len=1/threads=8       ,0.845
type=adaptive/non_crt_len=128/crt_len=1/threads=16      ,1.043
type=adaptive/non_crt_len=128/crt_len=1/threads=20      ,1.043
type=adaptive/non_crt_len=128/crt_len=2/threads=2       ,1.004
type=adaptive/non_crt_len=128/crt_len=2/threads=4       ,0.994
type=adaptive/non_crt_len=128/crt_len=2/threads=8       ,0.852
type=adaptive/non_crt_len=128/crt_len=2/threads=16      ,1.044
type=adaptive/non_crt_len=128/crt_len=2/threads=20      ,1.039
type=adaptive/non_crt_len=128/crt_len=4/threads=2       ,0.993
type=adaptive/non_crt_len=128/crt_len=4/threads=4       ,1.003
type=adaptive/non_crt_len=128/crt_len=4/threads=8       ,0.864
type=adaptive/non_crt_len=128/crt_len=4/threads=16      ,1.026
type=adaptive/non_crt_len=128/crt_len=4/threads=20      ,1.019
type=adaptive/non_crt_len=128/crt_len=8/threads=2       ,1.005
type=adaptive/non_crt_len=128/crt_len=8/threads=4       ,0.975
type=adaptive/non_crt_len=128/crt_len=8/threads=8       ,0.898
type=adaptive/non_crt_len=128/crt_len=8/threads=16      ,0.979
type=adaptive/non_crt_len=128/crt_len=8/threads=20      ,0.977
type=adaptive/non_crt_len=128/crt_len=16/threads=2      ,1.035
type=adaptive/non_crt_len=128/crt_len=16/threads=4      ,0.809
type=adaptive/non_crt_len=128/crt_len=16/threads=8      ,0.887
type=adaptive/non_crt_len=128/crt_len=16/threads=16     ,0.939
type=adaptive/non_crt_len=128/crt_len=16/threads=20     ,0.942
type=adaptive/non_crt_len=128/crt_len=32/threads=2      ,1.026
type=adaptive/non_crt_len=128/crt_len=32/threads=4      ,0.805
type=adaptive/non_crt_len=128/crt_len=32/threads=8      ,0.889
type=adaptive/non_crt_len=128/crt_len=32/threads=16     ,0.956
type=adaptive/non_crt_len=128/crt_len=32/threads=20     ,0.937
type=adaptive/non_crt_len=128/crt_len=64/threads=2      ,0.971
type=adaptive/non_crt_len=128/crt_len=64/threads=4      ,0.822
type=adaptive/non_crt_len=128/crt_len=64/threads=8      ,0.944
type=adaptive/non_crt_len=128/crt_len=64/threads=16     ,0.914
type=adaptive/non_crt_len=128/crt_len=64/threads=20     ,0.928
type=adaptive/non_crt_len=128/crt_len=128/threads=2     ,0.810
type=adaptive/non_crt_len=128/crt_len=128/threads=4     ,0.824
type=adaptive/non_crt_len=128/crt_len=128/threads=8     ,0.940
type=adaptive/non_crt_len=128/crt_len=128/threads=16    ,0.919
type=adaptive/non_crt_len=128/crt_len=128/threads=20    ,0.929
philnik requested changes to this revision.Apr 18 2023, 8:48 PM

Looking at https://stackoverflow.com/questions/19863734/what-is-pthread-mutex-adaptive-np I don't think we want to do this change. If someone needs such a special mutex, I guess they have to implement it themselves. We have to provide a useful mutex for all cases, not just for low contention cases that only have short times during which the mutex is locked.

This revision now requires changes to proceed.Apr 18 2023, 8:48 PM
goldstein.w.n added a comment.EditedApr 18 2023, 9:05 PM

Looking at https://stackoverflow.com/questions/19863734/what-is-pthread-mutex-adaptive-np I don't think we want to do this change. If someone needs such a special mutex, I guess they have to implement it themselves. We have to provide a useful mutex for all cases, not just for low contention cases that only have short times during which the mutex is locked.

I don't think the SO link really does justice to the reality of the implementation.
At the very least since the SO post was been written the following have been
added.

  1. expo backoff to avoid memory traffic
  2. Read guards on the CAS (same as above as exclusive state in mind)

It still hits the futex, just not immediately as doing so has a serious
latency cost.

Spinning ~100 times with a dozen or so memory accesses (with likely
all of them read-only) has a minimal cost in the high contention case
(its +maybe 100-200 cycles but the futex alone is 10-100x that) and
helps ALOT in the lower contention case (avoids really expensive syscalls).

Edit: To be clear. I belive ADAPTIVE_NP is the generic implementation that
does well in all cases whereas TIMED_NP is the specialized.

Looking at https://stackoverflow.com/questions/19863734/what-is-pthread-mutex-adaptive-np I don't think we want to do this change. If someone needs such a special mutex, I guess they have to implement it themselves. We have to provide a useful mutex for all cases, not just for low contention cases that only have short times during which the mutex is locked.

I don't think the SO link really does justice to the reality of the implementation.
At the very least since the SO post was been written the following have been
added.

  1. expo backoff to avoid memory traffic
  2. Read guards on the CAS (same as above as exclusive state in mind)

It still hits the futex, just not immediately as doing so has a serious
latency cost.

Spinning ~100 times with a dozen or so memory accesses (with likely
all of them read-only) has a minimal cost in the high contention case
(its +maybe 100-200 cycles but the futex alone is 10-100x that) and
helps ALOT in the lower contention case (avoids really expensive syscalls).

Edit: To be clear. I belive ADAPTIVE_NP is the generic implementation that
does well in all cases whereas TIMED_NP is the specialized.

Also FWIW, the fast both is identical for both of them. The only difference
is whether they handle contention by just going directly to futex (TIMED_NP)
or spin for an adaptive number of iterations <= 100 (ADAPTIVE_NP).