Page MenuHomePhabricator

Implementation of C++20's P1135R6 for libcxx
ClosedPublic

Authored by __simt__ on Oct 4 2019, 1:11 PM.

Details

Summary

This is the first review of this code so there's a lot to look at, and I'm fully expecting a lot of changes will be requested.

This patch contains...

A) Changes to __threading_support that introduce:

  1. Low-level semaphores on POSIX and Apple GCD, and futex on Linux.
  2. Declarations for a sharded table of contention state in the dylib, used by atomic::wait.
  3. Declarations for a thread_local variable in the dylib, used by barriers.

B) The <atomic> changes from P1135:

  1. High QoI: multi-layered back-off, using either/both the state from 1b and futexes from 1a.
  2. Low QoI: exponential time back-off, using chrono only.

C) Barrier:

  1. High QoI: a tree barrier, using the acceleration state in 1c to amortize the extra round.
  2. Low QoI: a central barrier, with a specialization for the empty completion function.

D) Semaphore:

  1. All QoI: a general template semaphore for very large ptrdiff_t values.
  2. High QoI: a specialization for “reasonable” ptrdiff_t values, using semaphores in 1a and acceleration atomics.
  3. Low QoI: a specialization for unit count.

E) Latch:

  1. All QoI (low): a central latch. (If there's a desire to see it, I could borrow the same QoI knobs from barrier, but sizeof() would grow a lot.)

F) The first basic tests for each facility from P1135.

G) Miscellaneous tweaks I needed to get this to build as libcu++ (the CUDA variant). We can drop these or you can take them as improvements. One of them is a legit macro bug.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
This revision now requires changes to proceed.Nov 18 2019, 1:15 PM
__simt__ marked 6 inline comments as done.Nov 18 2019, 8:20 PM
__simt__ added inline comments.
libcxx/include/atomic
1736

Yes, I can do that.

1757

Ack.

libcxx/include/chrono
1571 ↗(On Diff #226990)

Timed wait functions need <chrono>. That meant that I had to port <chrono> to CUDA as part of this effort. In CUDA you can't ask the OS for the time, you need to ask the silicon for the time. I followed the precedent of external threading to implement this as external clocks. Makes sense?

libcxx/include/semaphore
78

NP, Eric had requested something here. Feel free to ask for comments elsewhere.

libcxx/include/type_traits
4032 ↗(On Diff #226990)

It's important for the CUDA port that there not be naked "namespace std { ... }" that don't use the macros, because we need to inject our namespace name. I tried to keep the drive-by to a minimum.

Thank you for all the work you've put into this patch. Here are a few more comments. Still working through all this code :)

I'm becoming increasingly worried about the amount of inline configuration and platform-specific code. I want to make sure we have tests for all of these (and ideally CI that can run all the tests).

libcxx/src/atomic.cpp
29

For what platforms is _LIBCPP_HAS_PLATFORM_WAIT_STATE false, and have you tested on those platforms? I'm worried that there might be compiler errors.

33

What's the point of this macro, ATOMIC_VAR_INIT (I realize you didn't add it, but I'm still curious)?

47

This is different if _LIBCPP_HAS_PLATFORM_WAIT_STATE is false, right?

130

Will this ever be true?

134

same as below

140

Is there a test for this case? What will be the effect of __s not getting updated?

149

Instead of having void pointers that are casted, I don't see any reason these couldn't be defined as their actual types (__cxx_atomic_impl* and __libcpp_platform_contention_t*).

163

Is __libcpp_platform_wait defined on non-linux machines?

libcxx/src/barrier.cpp
45

Will this ever not only happen on the first iteration (if so, move it out of the loop maybe)?

libcxx/src/semaphore.cpp
27 ↗(On Diff #226990)

Why is this only needed for apple?

123 ↗(On Diff #226990)

Where does 50 come from? Maybe make this a macro.

152 ↗(On Diff #226990)

Is this the same as while (__old == 0)?

__simt__ marked 24 inline comments as done.Jan 9 2020, 1:55 PM

No problem. I just want to know what I need to do in the next patch in order to move forward.

libcxx/src/atomic.cpp
29

It's false on CUDA. It would be false on platforms that can't rely on OS support for efficient waiting and have to fall back to polling with backoff.

33

Prior to P0883 merging into C++20, atomics are constructed in an uninitialized state. You're supposed to use this macro to give it a static initialization. This macro is deprecated after C++20.

47

If you have a platform wait state, then it's used by this facility.

If you don't have a platform wait state, then a condvar is used instead.

130

Oh yes. Every time that there is a contending waiter concurrent with this notifier. This condition is true whenever the facility's use is not trivial.

134

Answered below.

140

Undefined behavior. Usually a segfault.

149

This is the uglier part of the patch. We have this situation where one platform (Apple) is about to go from not having platform wait states, to having them. This API is trying to be able to take either kind of efficient wait structure in the application OR in the dylib, and provide efficient waiting either way.

I think it would not be unreasonable if you guys asked me to cut this part out in order to get a first commit of the facility, and then work in the background with Louis on recovering the capability in some way, with or without a public patch.

163

Yes.

libcxx/src/barrier.cpp
45

This needs to be inside the loop unfortunately. During the first round, we need to record our effective start (leaf) location in the tree. I could express it differently - record it only at the end of the first round - but it would remain in the loop nest.

What we could potentially do is dumb down this favorite barrier index concept, or remove it entirely. It's worth a relatively small amount of performance by comparison to using the tree barrier in the first place.

I would not be offended if you asked me to give you an ordered list of things I could delete and tell you what it costs you to delete them, approximately. Then we could stop where we are more comfortable.

libcxx/src/semaphore.cpp
27 ↗(On Diff #226990)

Because of how GCD semaphores work, unfortunately.

We could delete this by sending all Apple semaphores to the generic template based on atomics.

Last I spoke with Louis, we thought that would be acceptable.

123 ↗(On Diff #226990)

Yeah. Aren't there enough macros though? I think I might come back later with a different patch to propose a set of macros to configure back-offs.

152 ↗(On Diff #226990)

It's not. This is masking the lower 32-bits of a 64-bit value, and then comparing that with 0.

zoecarver added inline comments.Jan 10 2020, 2:54 PM
libcxx/src/atomic.cpp
33

Heh. I didn't realize that was part of the standard (I was wondering why it wasn't mangled). Good to know.

47

I see. I got confused for a second while trying to follow the #ifs.

140

Isn't that a problem if _LIBCPP_HAS_NO_PLATFORM_WAIT_TABLE isn't defined?

149

I'll defer to others on this but, (assuming it wouldn't be much more work for you) it might be better to remove that from this patch and add it as a follow-up patch.

In general, the less code in this patch, the faster it can get committed.

libcxx/src/barrier.cpp
45

What we could potentially do is dumb down this favorite barrier index concept, or remove it entirely. It's worth a relatively small amount of performance by comparison to using the tree barrier in the first place.

I don't know enough about this piece of code or potential methods of implementation to comment on what we should do. I'll defer to you/others on this.

libcxx/src/semaphore.cpp
123 ↗(On Diff #226990)

Yes, there are an unfortunate number of macros in this bit of code.

I don't feel strongly about adding a macro now or later, maybe add a comment, though (that the number isn't magic or referenced elsewhere).

152 ↗(On Diff #226990)

I'm pretty sure they are the same. Look at this example. The first and last functions generate the same optimized assembly.

__simt__ updated this revision to Diff 240389.Jan 25 2020, 10:34 AM
__simt__ marked 11 inline comments as done.

A lot of changes in this patch, mostly simplifications based on the areas that got feedback about complexity.

For the most part, I have deleted code:

  1. I deleted the condvar-based implementation of atomic_wait because no current platform would make use of it. That also allowed me to delete much of the complicated ABI tricks added in the last revision. That removed several configuration macros.
  2. After discussing it with Louis, I switched the Apple config to use atomics-based semaphores. Not only did that delete a lot of code on its own, but it made some other optimizations in semaphores no longer needed, so I deleted them too. That removed several more configuration macros as well.
  3. I deleted code dealing with the thread_local optimization in barrier. Using this_thread::get::id to inject a thread-stable search seed for the algorithm is about just as good, and everything for that already exists.
  4. Latches, barriers and semaphores make more direct use of the wait/notify functionality now. Where they had "exceptions" with customized waiting, now there's a unified interface there, which deletes code.

In addition to this, the internal abstractions are much better now. I particularly like how natively-supported Futex types pass-through the intermediate layer now, it just works.

I hope you find this version is close enough that we might soon switch to delta patches on top, in trunk.

Cheers,

Olivier

jfb added a subscriber: MadCoder.Mon, Jan 27, 10:42 AM

Not finished looking through this but, here are some comments.

Overall, I really like this change. It looks fantastic. It seems like a much cleaner/simpler implementation. Thanks for all the time you've spent on this.

libcxx/include/__threading_support
29

Are all the platforms we support guaranteed to have this header (they very well might, I just don't know)?

31

Is this macro still needed?

33

This could be an else block, no?

74

Are there any platforms where SEM_VALUE_MAX doesn't exist? Maybe some BSD platform? Could you check that it is defined (or maybe that we're on a certain platform)?

libcxx/include/atomic
2445

I might just be missing it but, is this in the standard? Otherwise, could you mangle it?

2446

nit: space between the equals.

libcxx/include/barrier
275

Will __barrier_base be defined if _LIBCPP_HAS_NO_TREE_BARRIER isn't?

libcxx/include/latch
54

Can we make it so that these headers just don't exist before C++11 via cmake? That might be a nicer way to fail.

__simt__ marked 10 inline comments as done.Mon, Jan 27, 1:02 PM

Thanks Zoe!

Some responses ->

libcxx/include/__threading_support
29

This header is part of Pthreads and this #include is in the Pthreads section of the header, so I think the answer is yes. The Apple case is special in that they have deprecated this header.

31

Yes. We need a way to disable the platform native semaphores. Both Apple (in the current design) and CUDA need it.

33

It would have to be an #elif !defined(...) so that CUDA could trigger it too.

74

Like the header, it's mandated by Pthreads. It's conceivable that some BSD is non-conforming and we still want to work on it, but I don't have a BSD system handy to try.

libcxx/include/atomic
2445

It is, in C++20.

2446

OK.

libcxx/include/barrier
275

Yes. The macro ends up selecting between two different base classes -- one is the scalable tree barrier, the other is the simpler central barrier.

libcxx/include/latch
54

I don't have this skill. :^/

But I do think that we want to support them in C++11/14/17. This author's production team is going to present it to users in these dialects, for instance.

This is looking pretty good, only a few nitpicks. Once the nitpicks are addressed we can commit this and then go incrementally.

libcxx/include/__threading_support
34

Any reason to introduce this macro instead of just use !defined(_LIBCPP_NO_NATIVE_SEMAPHORES) when you need it?

510

static constexpr?

libcxx/include/atomic
550

Just wondering: will this in any way make it harder to support <atomic> in freestanding?

1465

I'm not seeing this used anywhere -- am I missing something?

libcxx/include/semaphore
62

Olivier and I spoke offline, and it's reasonable to request C++14 at least here. Please do this in all the headers you're adding.

The idea is to avoid having headers that are supported in older dialects than they need to be (within reason), which is often a source of technical debt.

libcxx/src/atomic.cpp
48

In apple_availability.h, add:

#if defined(__ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__)
#if __ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__ >= 101500
#define _LIBCPP_USE_ULOCK
#endif
#elif defined(__ENVIRONMENT_IPHONE_OS_VERSION_MIN_REQUIRED__)
#if __ENVIRONMENT_IPHONE_OS_VERSION_MIN_REQUIRED__ >= 130000
#define _LIBCPP_USE_ULOCK
#endif
#elif defined(__ENVIRONMENT_TV_OS_VERSION_MIN_REQUIRED__)
#if __ENVIRONMENT_TV_OS_VERSION_MIN_REQUIRED__ >= 130000
#define _LIBCPP_USE_ULOCK
#endif
#elif defined(__ENVIRONMENT_WATCH_OS_VERSION_MIN_REQUIRED__)
#if __ENVIRONMENT_WATCH_OS_VERSION_MIN_REQUIRED__ >= 60000
#define _LIBCPP_USE_ULOCK
#endif
#endif // __ENVIRONMENT_.*_VERSION_MIN_REQUIRED__

That should do it for all platforms aligned to Mac OS 10.15. Feel free to use whatever name for _LIBCPP_USE_ULOCK -- _LIBCPP_USE_APPLE_ULOCK probably makes the most sense since it's an Apple-specific API.

Then, your #elif becomes #elif defined(__APPLE__) && defined(_LIBCPP_USE_APPLE_ULOCK).

__simt__ marked 11 inline comments as done.Fri, Feb 14, 8:35 AM

Replies. Also noting that I will include an update to the implementation status HTML page.

libcxx/include/__threading_support
34

I was going to say it’s more convenient for me, but I’m not even convinced.

I can streamline this.

510

Yes.

libcxx/include/atomic
550

A bit, yeah. There are a few ways to proceed.

1465

I intend to use that to implement the normative encouragement that atomic_signed/unsigned[...]_t should be the efficient ones for waiting. I added those types below, but they don’t follow that encouragement.

libcxx/include/semaphore
62

Yep

libcxx/src/atomic.cpp
48

Thanks!

__simt__ updated this revision to Diff 245055.Mon, Feb 17, 4:21 PM

This revision addresses the outstanding comments, and incorporates Louis' recommended macros for ulock detection.

I also added missing max() members to barrier/latch that were added by NB comment resolution, and I implemented the normative encouragement for the lock-free types to match the ideal contention type, if there is one. I added tests for these things.

Finally, I further improved the ABI resilience of the barrier type so it's almost a full pimpl now.

ldionne accepted this revision.Tue, Feb 18, 7:11 AM

Thanks a lot for all the work and patience, Olivier. I think we're good to go now. Do you have commit access?

Also, can you please confirm that you meant to introduce all the following new symbols in the dylib, and nothing else:

std::__1::__libcpp_atomic_wait(std::__1::__cxx_atomic_impl<long long, std::__1::__cxx_atomic_base_impl<long long> > const volatile*, long long)
std::__1::__libcpp_atomic_wait(void const volatile*, long long)
std::__1::__cxx_atomic_notify_all(std::__1::__cxx_atomic_impl<long long, std::__1::__cxx_atomic_base_impl<long long> > const volatile*)
std::__1::__cxx_atomic_notify_all(void const volatile*)
std::__1::__cxx_atomic_notify_one(std::__1::__cxx_atomic_impl<long long, std::__1::__cxx_atomic_base_impl<long long> > const volatile*)
std::__1::__cxx_atomic_notify_one(void const volatile*)
std::__1::__libcpp_atomic_monitor(std::__1::__cxx_atomic_impl<long long, std::__1::__cxx_atomic_base_impl<long long> > const volatile*)
std::__1::__libcpp_atomic_monitor(void const volatile*)
std::__1::__arrive_barrier_algorithm_base(std::__1::__barrier_algorithm_base*, unsigned char)
std::__1::__destroy_barrier_algorithm_base(std::__1::__barrier_algorithm_base*)
std::__1::__construct_barrier_algorithm_base(long&)

Those will have to be added to the ABI list file (I can do that).

This revision is now accepted and ready to land.Tue, Feb 18, 7:11 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMon, Feb 24, 8:03 AM
ldionne added inline comments.Mon, Feb 24, 8:50 AM
libcxx/include/atomic
443

We forgot to update the synopsis for this header with the C++20 Synchronization Library. @__simt__ would you be willing to do that in a followup change?

teemperor added inline comments.
libcxx/include/module.modulemap
234

Maybe I'm missing something here, but due to this submodule we are always parsing the barrier header even when building the module with a language standard < C++14. This means that everyone using C++11 is no longer able to use the 'std' Clang module after this commit. Is this intentional?

ldionne added inline comments.Mon, Feb 24, 10:50 AM
libcxx/include/module.modulemap
234

No, this is not intentional. Sorry, there were several failures that needed fixing after committing this (failures that were impossible to notice without throwing the change at all the build bots) -- we're getting there.

davide added a subscriber: davide.Mon, Feb 24, 10:58 AM

This is the second time something like this happened. When you make such large scale changes -- make sure to run the lldb testsuite, or be ready to revert. We already had this discussion in the past, but clearly it didn't prevent the problem from happening again.

teemperor added inline comments.Mon, Feb 24, 11:22 AM
libcxx/include/module.modulemap
234

No worries, pushed a fix in b61e83eb0e31c1e6006569b43bb98a61ff44ca4c

This is the second time something like this happened. When you make such large scale changes -- make sure to run the lldb testsuite, or be ready to revert. We already had this discussion in the past, but clearly it didn't prevent the problem from happening again.

Last time's discussion didn't get anywhere, as running the LLDB test suite on each commit we make to libc++ isn't a viable option.

Libc++'s test matrix is insanely large, and actually we can't even see all of it cause some uses are behind closed doors. Whenever we make a non-trivial change (and BTW this is a purely additive change), it breaks someone somewhere. And you can rest assured that we do run a lot of testing locally to make sure we don't break people before committing, but it can't catch everything. That's just the way it is, and we try to fix it as quickly as possible -- I've spent all day so far trying to fix the consequences of applying this patch. This is not a lack of diligence, it's just that the nature of libc++ makes it difficult to test comprehensively.

I don't know what this commit broke in LLDB, if anything, but instead it's more useful to comment here with a link to the failure so we can help fix it. Being combative is no help, as we're all in the same boat.

This is the second time something like this happened. When you make such large scale changes -- make sure to run the lldb testsuite, or be ready to revert. We already had this discussion in the past, but clearly it didn't prevent the problem from happening again.

Last time's discussion didn't get anywhere, as running the LLDB test suite on each commit we make to libc++ isn't a viable option.

Libc++'s test matrix is insanely large, and actually we can't even see all of it cause some uses are behind closed doors. Whenever we make a non-trivial change (and BTW this is a purely additive change), it breaks someone somewhere. And you can rest assured that we do run a lot of testing locally to make sure we don't break people before committing, but it can't catch everything. That's just the way it is, and we try to fix it as quickly as possible -- I've spent all day so far trying to fix the consequences of applying this patch. This is not a lack of diligence, it's just that the nature of libc++ makes it difficult to test comprehensively.

I don't know what this commit broke in LLDB, if anything, but instead it's more useful to comment here with a link to the failure so we can help fix it. Being combative is no help, as we're all in the same boat.

I would like to reiterate that the policy in LLVM is that commit that break projects can be reverted willy-nilly.
I would also like to stress that my use cases are not behind closed doors -- LLDB is part of the LLVM umbrella.
I thought we reached an agreement last time, but looks like there was a miscommunication or misunderstanding on your side, so let me reiterate that you have two options:

  1. Run the lldb testsuite for changes that impact layout of structures -- or in any case you consider non-trivial, or if you don't think you can do this, at least ping somebody from lldb to take a look at the changes before they're committed. This of course requires some judgement on your side. When in doubt ask, as sending an e-mail is cheap.
  2. You commit without pre-commit checking lldb because you consider the additive cost of running ninja check-lldb is prohibitive. This is your choice, but don't be surprised if people will revert your commit if it breaks things.

Hopefully this clarifies my position.