This will also be used in some PSTL backends.
Details
- Reviewers
ldionne jdoerfert Mordante - Group Reviewers
Restricted Project - Commits
- rG31eeba3f7c0e: [libc++] Introduce __make_uninitialized_buffer and use it instead of…
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
libcxx/include/__algorithm/stable_partition.h | ||
---|---|---|
295 | I think __return_temporary_buffer is not needed anymore. | |
libcxx/include/__memory/uninitialized_buffer.h | ||
2 | Per our discussion just now, this might be worth replacing by: template <class _Destructor> struct __uninitialized_buffer_deleter { size_t __count_; _Destructor __dtor_; template <class _Tp> void operator()(_Tp* __ptr) const { __dtor_(__ptr, __count_); #ifdef _LIBCPP_HAS_NO_ALIGNED_ALLOCATION ::operator delete(__ptr, __count_ * sizeof(_Tp)); #else ::operator delete(__ptr, __count_ * sizeof(_Tp), align_val_t(_LIBCPP_ALIGNOF(_Tp))); #endif } }; template <class _Array, class _Destructor> unique_ptr<_Array, __uninitialized_buffer_deleter<_Destructor> > __get_uninitialized_buffer(size_t __n, _Destructor __destroy) { static_assert(is_array_v<_Array>); using _Tp = remove_extent_t<_Array>; #ifdef _LIBCPP_HAS_NO_ALIGNED_ALLOCATION _Tp* __ptr = static_cast<_Tp*>(::operator new(sizeof(_Tp) * __count)); #else _Tp* __ptr = static_cast<_Tp*>(::operator new(sizeof(_Tp) * __count, align_val_t(_LIBCPP_ALIGNOF(_Tp)))); #endif using _Deleter = __uninitialized_buffer_deleter<_Destructor>; _Deleter __deleter{__n, __destroy}; return unique_ptr<_Array, _Deleter>(__ptr, __deleter); } |
Would it make sense to add some internal tests for this class?
libcxx/include/__algorithm/stable_partition.h | ||
---|---|---|
19 | Can this be removed? | |
142–143 | Is never correct when the type does things in its destructor? | |
libcxx/include/__memory/construct_at.h | ||
99 | ||
libcxx/include/__memory/uninitialized_buffer.h | ||
35 |
Since I changed this to using a unqiue_ptr I don't think it makes a lot of sense to add any internal tests.
libcxx/include/__algorithm/stable_partition.h | ||
---|---|---|
142–143 | All the elements get destructed by __stable_partition_impl, so the buffer doesn't have to do anything other than deleting the allocation. |
libcxx/include/__memory/uninitialized_buffer.h | ||
---|---|---|
27 | Can you please document what this class does. In particular, please document how the destructor function gets called (and include what the count represents -- bytes or number of elements). This is pretty obvious if you stop to think about it, but it should be documented. | |
libcxx/test/std/algorithms/alg.modifying.operations/alg.partitions/stable_partition.not_enough_memory.pass.cpp | ||
35 ↗ | (On Diff #529102) | You should check that the range has been partitioned. And actually maybe this should be in the general test for stable_partition? Same comments below. |
libcxx/test/std/algorithms/alg.sorting/alg.sort/stable.sort/stable_sort.pass.cpp | ||
---|---|---|
159–165 | Here and everywhere -- you used 2-space indentation instead of 4 spaces as in the existing file. |
We're hitting CFI errors in Chromium after this, see https://github.com/llvm/llvm-project/issues/63523
libcxx/include/__memory/uninitialized_buffer.h | ||
---|---|---|
72–74 | Hi, a regression is observed in a benchmark that uses std::stable_sort intensively. Comparing baseline and the regressed case, the baseline (using get_temporary_buffer) uses unaligned alloc ( void* operator new[](size_t size) ) when objects are not over aligned, and the regressed uses aligned alloc for all objects. This may increase the memory usage as well. Given std::stable_sort is pretty important, I wonder if this regression warrants a fix to have consistent usage of (aligned or unaligned) new operator? |
libcxx/include/__memory/uninitialized_buffer.h | ||
---|---|---|
72–74 | I don't understand how using the aligned alloc would increase memory usage. void* operator new[](size_t size) is basically void* operator new[](size_t size, align_val_t{__STDCPP_DEFAULT_NEW_ALIGNMENT__}). So unless you have a bad implementation it shouldn't make any difference. |
libcxx/include/__memory/uninitialized_buffer.h | ||
---|---|---|
72–74 | thanks for the quick response! To share more context, the malloc implementation is tcmalloc (https://github.com/google/tcmalloc); and aligned alloc exercises more instructions than unaligned one. Considering the affected library is stable_sort, the benefit of keeping unaligned alloc is not small.
Comparing the llvm-objdump output below, aligned operator new has a longer instruction sequence (0x154 for regressed case, 0xec for baseline) -> this increases the number of instructions executed on a hot path (std::stable_sort). 000000000b4702c0 g F section-name 00000000000000ec tcmalloc_size_returning_operator_new # new operator used in BASELINE 000000000b4703c0 g F section-name 0000000000000154 tcmalloc_size_returning_operator_new_aligned 000000000b470540 g F section-name 00000000000001ec tcmalloc_size_returning_operator_new_hot_cold 000000000b470740 g F section-name 00000000000002ac tcmalloc_size_returning_operator_new_aligned_hot_cold 000000000b470a00 g F section-name 00000000000000ec tcmalloc_size_returning_operator_new_nothrow 000000000b470b00 g F section-name 0000000000000154 tcmalloc_size_returning_operator_new_aligned_nothrow # new operator in EXPERIMENT 000000000b470c80 g F section-name 00000000000001ec tcmalloc_size_returning_operator_new_hot_cold_nothrow 000000000b470e80 g F section-name 00000000000002ac tcmalloc_size_returning_operator_new_aligned_hot_cold_nothrow |
libcxx/include/__memory/uninitialized_buffer.h | ||
---|---|---|
72–74 | How exactly does that correlate to memory use? I have no idea how TCMalloc is implemented, so your links don't really help me. It could just be the case that the aligned path isn't as well optimized as the non-aligned one, resulting in slightly worse performance. I don't see any theoretical reason the aligned path should be any slower than the non-aligned one. Worst-case should be a single condition to see whether the non-aligned path should be taken if it's actually faster. |
libcxx/include/__memory/uninitialized_buffer.h | ||
---|---|---|
72–74 |
I consulted tcmalloc experts and got the confirmation that this won't increase memory usage; however, using aligned alloc where it's not necessary is going to slow down the execution. I'm not very familiar with the implementation either and misread how alignment affects the size class calculation; sorry about the wrong statement on memory increase. In the regressed case, aligned operator new does extra work to achieve the same results. To put it the other way, the unaligned operator new can use a compile time constant in its calculations, and saves a runtime cost to choose a size class.
At a large scale, a single extra if statement here adds up across multiple binaries since stable_sort is used widely. Could the original unaligned new operator be added for parity? |
Should we partially revert this change? Instead of making use of the new interface for stable_sort, the old get_temporary_buffer can be used (as an internal API), until the performance parity is reached?
I'm working on adding a CFI run to our CI right now. It looks like I just have to add _LIBCPP_NO_CFI to __make_uninitialized_buffer, given that get_temporary_buffer also has that attribute. I hope to get the patch completed by tomorrow.
I don't see why. If the allocation is actually so costly that it dominates stable_sort, you should either look into whether you actually need stable_sort for your workload or look into getting a better allocator (TCMalloc is probably not bad, so I suspect that's not the problem). Sorting doesn't seem like a task where a single allocation should ever dominate. I'd also like to see the actual use case where a single branch is so problematic that it warrants a revert before doing so. I don't even know how much of an impact this patch actually is in your specific use-case.
My understanding is that problem is caused by the more aggressive use of aligned allocation with this patch (not due to increased memory usage). The old interface has special code (additional branch) to check the overalignment and specialize based on that. Note those check are compile time constant and will be optimized away. In that regard, the new implementation misses out the optimization (specialization) so I believe it is a regression.
The problem happens to be caught by using tcmalloc because it has a very well tuned code path for unaligned case.
I see the patch landed as https://github.com/llvm/llvm-project/commit/420a204d52205f1277a8d5df3dbafac6082e02e2 thanks!
First, I appreciate the energy you're putting into this project. Your work doesn't go unnoticed.
As for this std::stable_sort topic, I get your position. Ideally, the performance difference between these
two approaches wouldn't be worth discussing. But reality often diverges from theory.
The folks who raised concerns are experiencing substantial performance impacts.
They've presented data to support this - the kind of proactive involvement we strive to cultivate in our community.
However, some of the discourse seems a bit hasty.
Instead of outright dismissing these concerns, let's take a step back and understand the nuances.
If clarity is missing, ask for more details.
It's also worth mentioning that referring to someone's implementation as "bad" may not be the best choice of words.
A more understanding tone can make a difference in promoting a collaborative and respectful community.
Your dedication to this project is appreciated. Let's keep improving it together.
Forgot to answer the common on the use of stable_sort:
The stable sort is called by another common library not controlled directly by the user. Besides the behavior depends on the input. The problem will manifest when the array to be sorted is small.
or look into getting a better allocator (TCMalloc is probably not bad, so I suspect that's not the problem). Sorting doesn't seem like a task where a single allocation should ever dominate. I'd also like to see the actual use case where a single branch is so problematic that it warrants a revert before doing so. I don't even know how much of an impact this patch actually is in your specific use-case.
Maybe I've missed it, but I didn't see any data other than "performance got worse".
They've presented data to support this - the kind of proactive involvement we strive to cultivate in our community.
However, some of the discourse seems a bit hasty.
Instead of outright dismissing these concerns, let's take a step back and understand the nuances.
My intention was never to dismiss the concern, but I'd like a specific use-case so I can understand what the problem is.
If clarity is missing, ask for more details.
It's also worth mentioning that referring to someone's implementation as "bad" may not be the best choice of words.
A more understanding tone can make a difference in promoting a collaborative and respectful community.Your dedication to this project is appreciated. Let's keep improving it together.
Yeah, that might not have been the best choice of words.
Maybe the better approach would be to improve our __stable_sort_switch heuristic. Would it be possible for you to share the problematic use-case? I suspect you might be using a non-trivially-copy-assignable type which isn't that costly to copy, resulting in small allocations that don't make any sense.
This patch affects multiple benchmarks. I checked one of them again, and it turned out the new calls to the aligned new are from inplace_merge which is also affected in this revision. Coincidentally, the stable_sort and inplace_merge are called back-to-back in the caller.
IMO I agree that if we were using aligned allocation and we're not anymore, this is simply something that went unnoticed in this patch and we should fix it.
Should stable_sort be dominated by an allocation? Probably not, that seems a bit crazy. But nonetheless it seems like we did introduce an unintended difference in this patch and it should be pretty easy to fix. Let's follow up on D152208.
Looks like I read this issue the wrong way around. So we were doing unaligned allocation before, and now we're using aligned allocation (which should be more correct, no?) and it's slower. That's a lot more curious, and I think it's worth investigating why this caused a regression because there might be some sort of perf issue in how your allocator deals with aligned allocation (or maybe on the libc++ side). Anyway, let's follow-up in D152208 to avoid forking the discussion too much.
As I mentioned in the previous reply, the performance regression comes from addition runtime cost (allocation) in the context of inplace_merge call. A little more debugging shows that in the old version, the allocation can be completely elided when buf_size is 0 (see get_temporary_buffer code), but make_uninitialized_buffer will try to invoke operator new regardless.
// inplace_merge code
difference_type __len1 = _IterOps<_AlgPolicy>::distance(__first, __middle); difference_type __len2 = _IterOps<_AlgPolicy>::distance(__middle, __last); difference_type __buf_size = _VSTD::min(__len1, __len2);
Ok, so this whole thing has nothing to do with the fact that we are using aligned allocation vs not using aligned allocation. It has to do with the fact that we're allocating, period.
I'm going to revert this patch for now, because @EricWF pointed out that we actually had an issue with the class itself -- if constructing any of the elements in the buffer throws, the destruction is not going to happen properly (we'll over-destroy). This wasn't a problem with the use case that this was meant for (PSTL), since we std::terminate() if the construction of any element would throw. But I agree that is a poor general-purpose API to have. We can take another stab at this next week.
Correct.
I'm going to revert this patch for now, because @EricWF pointed out that we actually had an issue with the class itself -- if constructing any of the elements in the buffer throws, the destruction is not going to happen properly (we'll over-destroy). This wasn't a problem with the use case that this was meant for (PSTL), since we std::terminate() if the construction of any element would throw. But I agree that is a poor general-purpose API to have. We can take another stab at this next week.
thanks for debugging this! I wish I continued to get a detailed call-graph when seeing the common caller instruction cycle increases rather than guessing it's stable_sort :(
Can this be removed?