Page MenuHomePhabricator

Inline hot functions in libcxx shared_ptr implementation.
ClosedPublic

Authored by hxy9243 on Sep 27 2016, 3:51 PM.

Details

Summary

This patch moves some existing functions from the memory.cpp to the memory header file, so that they could be properly inlined, which gives potential optimization opportunities and performance benefits.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hxy9243 updated this revision to Diff 72725.Sep 27 2016, 3:51 PM
hxy9243 set the repository for this revision to rL LLVM.
halyavin added inline comments.
libcxx/include/atomic_support.h
1 ↗(On Diff #72725)

Non-standard include files in the main include directory must start with __ to avoid collisions with application headers.

1 ↗(On Diff #72725)

Does anyone know why this header exists and atomic header can't be used instead?

hxy9243 updated this revision to Diff 73293.Oct 3 2016, 9:28 AM

Addresses comments from @halyavin, rename "atomic_support.h" to "__atomic_support" to avoid collisions with application headers.

mclow.lists edited edge metadata.Oct 9 2016, 3:07 PM

How does this play with existing binaries? Applications that expect these functions to exist in the dylib?

EricWF added a subscriber: EricWF.Oct 9 2016, 11:00 PM

How does this play with existing binaries? Applications that expect these functions to exist in the dylib?

This patch is majorly ABI breaking, although we could probably find a formulation that wasn't.

@hxy9243 wrote:
which gives potential optimization opportunities and performance benefits.

Please provide benchmark tests which demonstrate that these benefits are concrete and not just "potential". Moving methods out of the dylib is no easy task so I would like hard evidence that it's worth while.

libcxx/include/memory
3793 ↗(On Diff #73293)

Anonymous namespaces are a C++11 feature and this is C++03 code.

3798 ↗(On Diff #73293)
  • T and increment need to be reserved names.
  • Never use __attribute__((always_inline)) directly, that's why we have visibility macros.
3802 ↗(On Diff #73293)

Why add increment and decrement at all? Just manually inline __libcpp_atomic_add at the callsites.

3818 ↗(On Diff #73293)

Why would you want to inline this?

sebpop edited edge metadata.Oct 10 2016, 6:46 AM

Please provide benchmark tests which demonstrate that these benefits are concrete and not just "potential". Moving methods out of the dylib is no easy task so I would like hard evidence that it's worth while.

With this patch we have seen the score of a proprietary benchmark going up by 20%, matching the performance we see with LLVM + libstdc++.
We will provide a testcase that shows the performance uplift.

Please provide benchmark tests which demonstrate that these benefits are concrete and not just "potential". Moving methods out of the dylib is no easy task so I would like hard evidence that it's worth while.

With this patch we have seen the score of a proprietary benchmark going up by 20%, matching the performance we see with LLVM + libstdc++.
We will provide a testcase that shows the performance uplift.

20% sounds amazing! Thanks for working on this.

Thanks for pointing out. It's true that it may cause ABI breakage. It would be nice to keep compatibility while getting the performance benefits from inlining.

I've tested the patch with google-benchmark/util_smartptr_libcxx shipped with libcxx on x86_64 server, and attached the results as following:

BASE libcxx r283113:
$   taskset -c 23 ./util_smartptr.libcxx.out
Run on (24 X 1200 MHz CPU s)
2016-10-12 13:52:03
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
Benchmark                          Time           CPU Iterations
----------------------------------------------------------------
BM_SharedPtrCreateDestroy         54 ns         54 ns   12388755
BM_SharedPtrIncDecRef             37 ns         37 ns   19021739
BM_WeakPtrIncDecRef               38 ns         38 ns   18421053
 


libcxx with patch:
$   taskset -c 23 ./util_smartptr.libcxx.out
Run on (24 X 1200 MHz CPU s)
2016-10-12 13:48:38
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
Benchmark                          Time           CPU Iterations
----------------------------------------------------------------
BM_SharedPtrCreateDestroy         44 ns         44 ns   14730639
BM_SharedPtrIncDecRef             18 ns         18 ns   38888889
BM_WeakPtrIncDecRef               30 ns         30 ns   23648649

How does this play with existing binaries? Applications that expect these functions to exist in the dylib?

This patch is majorly ABI breaking, although we could probably find a formulation that wasn't.

Eric, Marshall,
any suggestions on how to fix the backwards compatibility issue?

Thanks!

hiraditya edited edge metadata.Oct 15 2016, 10:26 AM

Marshall suggests using macro as we discussed offline. For some reason the reply does not appear here: http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20161010/173780.html

Marshall suggests using macro as we discussed offline. For some reason the reply does not appear here: http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20161010/173780.html

Ping.

sebpop commandeered this revision.Oct 26 2016, 9:36 AM
sebpop edited reviewers, added: hxy9243; removed: sebpop.
sebpop updated this revision to Diff 75908.Oct 26 2016, 9:42 AM
sebpop edited edge metadata.
sebpop removed rL LLVM as the repository for this revision.

The patch also implements the idea that Marshall proposed in:
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20161010/173780.html

I have an idea; it involves a macro that is sometimes "inline" and
sometimes not, and changes when you're building the library vs. when you're
just including the headers.

Tested on x86_64-linux.
The symbols for the functions that are now inlined still appear in the libc++.so.

Ok to commit?

sebpop marked 2 inline comments as done.Oct 26 2016, 9:48 AM
sebpop added inline comments.
libcxx/include/memory
3802 ↗(On Diff #73293)

I like the idea to manually inline the inc and dec functions.
What should we do with the NOTE: above?

NOTE: Relaxed and acq/rel atomics (for increment and decrement respectively)
should be sufficient for thread safety.
// See https://llvm.org/bugs/show_bug.cgi?id=22803

should we just go ahead and remove the note, or you want to have it where inc/dec are called? (about a dozen places.)

hxy9243 accepted this revision.Oct 31 2016, 12:35 PM
hxy9243 edited edge metadata.

Looks good to me.
Notice that the performance gain can only be observed when compiled with the updated C++ header files.

This revision is now accepted and ready to land.Oct 31 2016, 12:35 PM

Ping: Eric, Marshall, could you please approve or comment on this patch?
Thanks!

Just a question: TSan intercepts on the dylib functions, namely __release_shared, to track the atomic accesses. Can you make sure this doesn't break? There's a few testcases for this in compiler-rt.

sebpop added a comment.Nov 2 2016, 1:08 PM
In D24991#583877, @kubabrecka wrote:

Just a question: TSan intercepts on the dylib functions, namely __release_shared, to track the atomic accesses. Can you make sure this doesn't break? There's a few testcases for this in compiler-rt.

I just ran ninja check-all with and without this patch and there are no regressions in compiler-rt on an x86_64-linux machine.

I just ran ninja check-all with and without this patch and there are no regressions in compiler-rt on an x86_64-linux machine.

The TSan interceptors (and testcases) are Darwin-only at this point. I'll run the tests on my machine.

In D24991#586248, @kubabrecka wrote:

I just ran ninja check-all with and without this patch and there are no regressions in compiler-rt on an x86_64-linux machine.

The TSan interceptors (and testcases) are Darwin-only at this point. I'll run the tests on my machine.

Any updates on the testing? Thanks very much!

This passes TSan tests on Darwin. LGTM.

sebpop added a comment.Nov 8 2016, 6:28 AM

@mclow.lists, @EricWF, ok to commit the patch?

Thanks,
Sebastian

Ping. @mclow.lists, @EricWF, any ideas on this patch?
Thanks very much!

@mclow.lists could you please have a last look at this patch: the change is for a performance improvement (20% uplift on a proprietary benchmark), and all the issues mentioned in the review have been addressed.
The existing synthetic benchmark shows an overall improvement:

master:
Benchmark                          Time           CPU Iterations
----------------------------------------------------------------
BM_SharedPtrCreateDestroy         54 ns         54 ns   12388755
BM_SharedPtrIncDecRef             37 ns         37 ns   19021739
BM_WeakPtrIncDecRef               38 ns         38 ns   18421053
 
master + patch:
Benchmark                          Time           CPU Iterations
----------------------------------------------------------------
BM_SharedPtrCreateDestroy         44 ns         44 ns   14730639
BM_SharedPtrIncDecRef             18 ns         18 ns   38888889
BM_WeakPtrIncDecRef               30 ns         30 ns   23648649
EricWF requested changes to this revision.Dec 30 2016, 2:47 AM
EricWF added a reviewer: EricWF.

Added a bunch of inline comments.

The biggest requested change is removing the __atomic_support header. We only need one atomic call within the headers. It's overkill to add a new header.

libcxx/include/__atomic_support
1 ↗(On Diff #75908)

I would greatly prefer if this patch didn't add another header, and simply defined __libcpp_atomic_increment and __libcpp_atomic_decrement in place of __atomic_inc_dec::increment/__atomic_inc_dec::decrement.

libcxx/include/memory
3911 ↗(On Diff #75908)

Please apply _LIBCPP_FUNC_VIS to both of these methods.

3914 ↗(On Diff #75908)

the inline in redundant if you define the function inside the class.

3948 ↗(On Diff #75908)

Please add _LIBCPP_FUNC_VIS to the three methods.

3952 ↗(On Diff #75908)

inline is redundant here.

3802 ↗(On Diff #73293)

Neremind about the manually inlining bit.

Please remove the __atomic_inc_dec namespace and rename increment to __libcpp_atomic_increment and decrement to __libcpp_atomic_decrement.

Please also remove the __atomic_support header and instead simply call __atomic_add_fetch from inside the functions.

This revision now requires changes to proceed.Dec 30 2016, 2:47 AM
hxy9243 commandeered this revision.Jan 3 2017, 2:54 PM
hxy9243 edited reviewers, added: sebpop; removed: hxy9243.
hxy9243 updated this revision to Diff 82958.Jan 3 2017, 2:59 PM
hxy9243 edited edge metadata.
hxy9243 set the repository for this revision to rL LLVM.

Move the header back in its place, and only copy over necessary parts. Now call __atomic_add_fetch from inside the functions.

hxy9243 updated this revision to Diff 82962.Jan 3 2017, 3:04 PM
hxy9243 edited edge metadata.
hxy9243 marked 4 inline comments as done.

Minor fix, remove redundant inlines.

hxy9243 marked an inline comment as done.Jan 10 2017, 2:40 PM

Addressed previous issues in the comments. The patch still shows consistent perf uplift in proprietary benchmark on shared_ptr.

@EricWF @sebpop @hiraditya Any thoughts?

kubamracek added inline comments.Jan 10 2017, 5:03 PM
libcxx/include/memory
3702 ↗(On Diff #82962)

I don't think this should be named __libcpp_atomic_increment, because it uses relaxed ordering and thus it's not a generic increment (same goes for decrement). Could we rename this to __libcpp_atomic_refcount_increment or something similar? That would suggest why we're using acq+rel on one side and relaxed on other side. Using these functions for non-refcount purposes will be wrong and the current names (__libcpp_atomic_increment) suggest that they're doing generic atomic operations.

hxy9243 updated this revision to Diff 84004.EditedJan 11 2017, 11:32 AM

Adresses comments from @kubabrecka: minor changes on function names. Rename __libcpp_atomic_* to __libcpp_atomic_refcount_*.

EricWF requested changes to this revision.Jan 14 2017, 3:22 AM
EricWF edited edge metadata.

Almost LGTM. Just a couple of inline comments left. Thanks for working on this!

libcxx/include/memory
3691 ↗(On Diff #84004)
  1. I would reduce these checks down to only what we need in the headers.
  2. I would rename _LIBCPP_HAS_ATOMIC_BUILTINS to _LIBCPP_HAS_BUILTIN_ATOMIC_SUPPORT so it doesn't conflict with atomic_support.h and so we don't get it confused with all of the other _LIBCPP_ATOMIC configuration macros.
  3. This is missing the configuration checks for GCC. Specifically #elif !defined(__clang__) && defined(_GNUC_VER) && _GNUC_VER >= 407
3701 ↗(On Diff #84004)

inline _LIBCPP_INLINE_VISIBILITY

3759 ↗(On Diff #84004)

_LIBCPP_FUNC_VIS goes before the return type.

3797 ↗(On Diff #84004)

_LIBCPP_FUNC_VIS goes before the return type.

This revision now requires changes to proceed.Jan 14 2017, 3:22 AM
hxy9243 updated this revision to Diff 84565.Jan 16 2017, 8:16 AM
hxy9243 edited edge metadata.

Addresses comments from @EricWF.

Thanks for reviewing, I know it takes a lot of energy. It helped me learn a lot.

hxy9243 marked 3 inline comments as done.Jan 16 2017, 8:16 AM
mclow.lists added inline comments.Jan 16 2017, 8:31 AM
libcxx/include/memory
3700 ↗(On Diff #84565)

template <class _Tp>, please.

Otherwise when some client code does #define T true (yes, I've seen that!) this breaks. _Tp is a reserved identifier, and if they use that, we can point at them and laugh.

3702 ↗(On Diff #84565)

The parameter name needs to be reserved as well. __t, please.

3711 ↗(On Diff #84565)

Same comment as L3700

3713 ↗(On Diff #84565)

Same comment as L3702

hxy9243 updated this revision to Diff 84569.Jan 16 2017, 9:03 AM
hxy9243 edited edge metadata.

Addresses comments from @mclow.lists.

hxy9243 marked 6 inline comments as done.Jan 16 2017, 9:05 AM
EricWF accepted this revision.Jan 16 2017, 12:05 PM
EricWF edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Jan 16 2017, 12:05 PM
This revision was automatically updated to reflect the committed changes.