This patch moves some existing functions from the memory.cpp to the memory header file, so that they could be properly inlined, which gives potential optimization opportunities and performance benefits.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Addresses comments from @halyavin, rename "atomic_support.h" to "__atomic_support" to avoid collisions with application headers.
How does this play with existing binaries? Applications that expect these functions to exist in the dylib?
This patch is majorly ABI breaking, although we could probably find a formulation that wasn't.
Please provide benchmark tests which demonstrate that these benefits are concrete and not just "potential". Moving methods out of the dylib is no easy task so I would like hard evidence that it's worth while.
libcxx/include/memory | ||
---|---|---|
3793 ↗ | (On Diff #73293) | Anonymous namespaces are a C++11 feature and this is C++03 code. |
3798 ↗ | (On Diff #73293) |
|
3802 ↗ | (On Diff #73293) | Why add increment and decrement at all? Just manually inline __libcpp_atomic_add at the callsites. |
3818 ↗ | (On Diff #73293) | Why would you want to inline this? |
With this patch we have seen the score of a proprietary benchmark going up by 20%, matching the performance we see with LLVM + libstdc++.
We will provide a testcase that shows the performance uplift.
Thanks for pointing out. It's true that it may cause ABI breakage. It would be nice to keep compatibility while getting the performance benefits from inlining.
I've tested the patch with google-benchmark/util_smartptr_libcxx shipped with libcxx on x86_64 server, and attached the results as following:
BASE libcxx r283113: $ taskset -c 23 ./util_smartptr.libcxx.out Run on (24 X 1200 MHz CPU s) 2016-10-12 13:52:03 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. Benchmark Time CPU Iterations ---------------------------------------------------------------- BM_SharedPtrCreateDestroy 54 ns 54 ns 12388755 BM_SharedPtrIncDecRef 37 ns 37 ns 19021739 BM_WeakPtrIncDecRef 38 ns 38 ns 18421053 libcxx with patch: $ taskset -c 23 ./util_smartptr.libcxx.out Run on (24 X 1200 MHz CPU s) 2016-10-12 13:48:38 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. Benchmark Time CPU Iterations ---------------------------------------------------------------- BM_SharedPtrCreateDestroy 44 ns 44 ns 14730639 BM_SharedPtrIncDecRef 18 ns 18 ns 38888889 BM_WeakPtrIncDecRef 30 ns 30 ns 23648649
Eric, Marshall,
any suggestions on how to fix the backwards compatibility issue?
Thanks!
Marshall suggests using macro as we discussed offline. For some reason the reply does not appear here: http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20161010/173780.html
The patch also implements the idea that Marshall proposed in:
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20161010/173780.html
I have an idea; it involves a macro that is sometimes "inline" and
sometimes not, and changes when you're building the library vs. when you're
just including the headers.
Tested on x86_64-linux.
The symbols for the functions that are now inlined still appear in the libc++.so.
Ok to commit?
libcxx/include/memory | ||
---|---|---|
3802 ↗ | (On Diff #73293) | I like the idea to manually inline the inc and dec functions. NOTE: Relaxed and acq/rel atomics (for increment and decrement respectively) should we just go ahead and remove the note, or you want to have it where inc/dec are called? (about a dozen places.) |
Looks good to me.
Notice that the performance gain can only be observed when compiled with the updated C++ header files.
Just a question: TSan intercepts on the dylib functions, namely __release_shared, to track the atomic accesses. Can you make sure this doesn't break? There's a few testcases for this in compiler-rt.
I just ran ninja check-all with and without this patch and there are no regressions in compiler-rt on an x86_64-linux machine.
The TSan interceptors (and testcases) are Darwin-only at this point. I'll run the tests on my machine.
@mclow.lists could you please have a last look at this patch: the change is for a performance improvement (20% uplift on a proprietary benchmark), and all the issues mentioned in the review have been addressed.
The existing synthetic benchmark shows an overall improvement:
master: Benchmark Time CPU Iterations ---------------------------------------------------------------- BM_SharedPtrCreateDestroy 54 ns 54 ns 12388755 BM_SharedPtrIncDecRef 37 ns 37 ns 19021739 BM_WeakPtrIncDecRef 38 ns 38 ns 18421053 master + patch: Benchmark Time CPU Iterations ---------------------------------------------------------------- BM_SharedPtrCreateDestroy 44 ns 44 ns 14730639 BM_SharedPtrIncDecRef 18 ns 18 ns 38888889 BM_WeakPtrIncDecRef 30 ns 30 ns 23648649
Added a bunch of inline comments.
The biggest requested change is removing the __atomic_support header. We only need one atomic call within the headers. It's overkill to add a new header.
libcxx/include/__atomic_support | ||
---|---|---|
1 ↗ | (On Diff #75908) | I would greatly prefer if this patch didn't add another header, and simply defined __libcpp_atomic_increment and __libcpp_atomic_decrement in place of __atomic_inc_dec::increment/__atomic_inc_dec::decrement. |
libcxx/include/memory | ||
3911 ↗ | (On Diff #75908) | Please apply _LIBCPP_FUNC_VIS to both of these methods. |
3914 ↗ | (On Diff #75908) | the inline in redundant if you define the function inside the class. |
3948 ↗ | (On Diff #75908) | Please add _LIBCPP_FUNC_VIS to the three methods. |
3952 ↗ | (On Diff #75908) | inline is redundant here. |
3802 ↗ | (On Diff #73293) | Neremind about the manually inlining bit. Please remove the __atomic_inc_dec namespace and rename increment to __libcpp_atomic_increment and decrement to __libcpp_atomic_decrement. Please also remove the __atomic_support header and instead simply call __atomic_add_fetch from inside the functions. |
Move the header back in its place, and only copy over necessary parts. Now call __atomic_add_fetch from inside the functions.
Addressed previous issues in the comments. The patch still shows consistent perf uplift in proprietary benchmark on shared_ptr.
@EricWF @sebpop @hiraditya Any thoughts?
libcxx/include/memory | ||
---|---|---|
3702 ↗ | (On Diff #82962) | I don't think this should be named __libcpp_atomic_increment, because it uses relaxed ordering and thus it's not a generic increment (same goes for decrement). Could we rename this to __libcpp_atomic_refcount_increment or something similar? That would suggest why we're using acq+rel on one side and relaxed on other side. Using these functions for non-refcount purposes will be wrong and the current names (__libcpp_atomic_increment) suggest that they're doing generic atomic operations. |
Adresses comments from @kubabrecka: minor changes on function names. Rename __libcpp_atomic_* to __libcpp_atomic_refcount_*.
Almost LGTM. Just a couple of inline comments left. Thanks for working on this!
libcxx/include/memory | ||
---|---|---|
3691 ↗ | (On Diff #84004) |
|
3701 ↗ | (On Diff #84004) | inline _LIBCPP_INLINE_VISIBILITY |
3759 ↗ | (On Diff #84004) | _LIBCPP_FUNC_VIS goes before the return type. |
3797 ↗ | (On Diff #84004) | _LIBCPP_FUNC_VIS goes before the return type. |
Addresses comments from @EricWF.
Thanks for reviewing, I know it takes a lot of energy. It helped me learn a lot.
libcxx/include/memory | ||
---|---|---|
3700 ↗ | (On Diff #84565) | template <class _Tp>, please. Otherwise when some client code does #define T true (yes, I've seen that!) this breaks. _Tp is a reserved identifier, and if they use that, we can point at them and laugh. |
3702 ↗ | (On Diff #84565) | The parameter name needs to be reserved as well. __t, please. |
3711 ↗ | (On Diff #84565) | Same comment as L3700 |
3713 ↗ | (On Diff #84565) | Same comment as L3702 |