This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Introduce __make_uninitialized_buffer and use it instead of get_temporary_buffer
AcceptedPublic

Authored by philnik on Jun 5 2023, 4:36 PM.

Details

Summary

This will also be used in some PSTL backends.

Diff Detail

Event Timeline

philnik created this revision.Jun 5 2023, 4:36 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2023, 4:36 PM
philnik requested review of this revision.Jun 5 2023, 4:36 PM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a reviewer: Restricted Project. · View Herald Transcript
ldionne added inline comments.Jun 6 2023, 8:31 AM
libcxx/include/__algorithm/stable_partition.h
295

I think __return_temporary_buffer is not needed anymore.

libcxx/include/__memory/uninitialized_buffer.h
2

Per our discussion just now, this might be worth replacing by:

template <class _Destructor>
struct __uninitialized_buffer_deleter {
    size_t __count_;
    _Destructor __dtor_;

    template <class _Tp>
    void operator()(_Tp* __ptr) const {
        __dtor_(__ptr, __count_);
#ifdef _LIBCPP_HAS_NO_ALIGNED_ALLOCATION
        ::operator delete(__ptr, __count_ * sizeof(_Tp));
#else
        ::operator delete(__ptr, __count_ * sizeof(_Tp), align_val_t(_LIBCPP_ALIGNOF(_Tp)));
#endif
    }
};

template <class _Array, class _Destructor>
unique_ptr<_Array, __uninitialized_buffer_deleter<_Destructor> > __get_uninitialized_buffer(size_t __n, _Destructor __destroy) {
    static_assert(is_array_v<_Array>);
    using _Tp = remove_extent_t<_Array>;
#ifdef _LIBCPP_HAS_NO_ALIGNED_ALLOCATION
    _Tp* __ptr = static_cast<_Tp*>(::operator new(sizeof(_Tp) * __count));
#else
    _Tp* __ptr = static_cast<_Tp*>(::operator new(sizeof(_Tp) * __count, align_val_t(_LIBCPP_ALIGNOF(_Tp))));
#endif

    using _Deleter = __uninitialized_buffer_deleter<_Destructor>;
    _Deleter __deleter{__n, __destroy};
    return unique_ptr<_Array, _Deleter>(__ptr, __deleter);
}

Would it make sense to add some internal tests for this class?

libcxx/include/__algorithm/stable_partition.h
19

Can this be removed?

142–143

Is never correct when the type does things in its destructor?
Or are all elements in a moved from state?

libcxx/include/__memory/construct_at.h
99
libcxx/include/__memory/uninitialized_buffer.h
35
philnik updated this revision to Diff 528979.Jun 6 2023, 12:05 PM
philnik marked 6 inline comments as done.

Address comments

Would it make sense to add some internal tests for this class?

Since I changed this to using a unqiue_ptr I don't think it makes a lot of sense to add any internal tests.

libcxx/include/__algorithm/stable_partition.h
142–143

All the elements get destructed by __stable_partition_impl, so the buffer doesn't have to do anything other than deleting the allocation.

philnik retitled this revision from [libc++] Introduce __uninitialized_buffer and use it instead of get_temporary_buffer to [libc++] Introduce __make_uninitialized_buffer and use it instead of get_temporary_buffer.Jun 6 2023, 2:28 PM
philnik updated this revision to Diff 529048.Jun 6 2023, 2:34 PM

Try to fix CI

philnik updated this revision to Diff 529102.Jun 6 2023, 5:13 PM

Try to fix CI

ldionne accepted this revision.Jun 7 2023, 8:29 AM
ldionne added inline comments.
libcxx/include/__memory/uninitialized_buffer.h
27

Can you please document what this class does. In particular, please document how the destructor function gets called (and include what the count represents -- bytes or number of elements). This is pretty obvious if you stop to think about it, but it should be documented.

libcxx/test/std/algorithms/alg.modifying.operations/alg.partitions/stable_partition.not_enough_memory.pass.cpp
35 ↗(On Diff #529102)

You should check that the range has been partitioned.

And actually maybe this should be in the general test for stable_partition? Same comments below.

This revision is now accepted and ready to land.Jun 7 2023, 8:29 AM
philnik updated this revision to Diff 529382.Jun 7 2023, 11:18 AM
philnik marked 2 inline comments as done.

Address comments

philnik updated this revision to Diff 529632.Jun 8 2023, 9:26 AM

Try to fix CI

philnik updated this revision to Diff 529665.Jun 8 2023, 10:21 AM

Fix formatting

Mordante accepted this revision.Jun 8 2023, 11:37 AM

LGTM!

philnik updated this revision to Diff 529701.Jun 8 2023, 1:17 PM

Try to fix CI

philnik updated this revision to Diff 529781.Jun 8 2023, 4:47 PM

Try to fix CI

philnik updated this revision to Diff 529992.Jun 9 2023, 9:37 AM

Try to fix CI

philnik updated this revision to Diff 530105.Jun 9 2023, 3:44 PM

Try to fix CI

philnik updated this revision to Diff 530519.Jun 12 2023, 8:36 AM

Try to fix CI

philnik updated this revision to Diff 530909.Jun 13 2023, 7:53 AM

Try to fix CI

philnik updated this revision to Diff 530996.Jun 13 2023, 11:08 AM

assert on invalid params

ldionne added inline comments.Jun 13 2023, 3:50 PM
libcxx/test/std/algorithms/alg.sorting/alg.sort/stable.sort/stable_sort.pass.cpp
159–165

Here and everywhere -- you used 2-space indentation instead of 4 spaces as in the existing file.

philnik updated this revision to Diff 531388.Jun 14 2023, 9:38 AM

Try adding a nothrow version

philnik updated this revision to Diff 531404.Jun 14 2023, 10:00 AM

Try to fix CI

philnik updated this revision to Diff 531920.Jun 15 2023, 3:24 PM

Try to fix CI

hans added a subscriber: hans.Jun 26 2023, 5:43 AM

We're hitting CFI errors in Chromium after this, see https://github.com/llvm/llvm-project/issues/63523

mingmingl added inline comments.
libcxx/include/__memory/uninitialized_buffer.h
72–74

Hi, a regression is observed in a benchmark that uses std::stable_sort intensively.

Comparing baseline and the regressed case, the baseline (using get_temporary_buffer) uses unaligned alloc ( void* operator new[](size_t size) ) when objects are not over aligned, and the regressed uses aligned alloc for all objects. This may increase the memory usage as well.

Given std::stable_sort is pretty important, I wonder if this regression warrants a fix to have consistent usage of (aligned or unaligned) new operator?

philnik added inline comments.Jun 26 2023, 9:12 AM
libcxx/include/__memory/uninitialized_buffer.h
72–74

I don't understand how using the aligned alloc would increase memory usage. void* operator new[](size_t size) is basically void* operator new[](size_t size, align_val_t{__STDCPP_DEFAULT_NEW_ALIGNMENT__}). So unless you have a bad implementation it shouldn't make any difference.

mingmingl added inline comments.Jun 26 2023, 10:55 AM
libcxx/include/__memory/uninitialized_buffer.h
72–74

thanks for the quick response!

To share more context, the malloc implementation is tcmalloc (https://github.com/google/tcmalloc); and aligned alloc exercises more instructions than unaligned one. Considering the affected library is stable_sort, the benefit of keeping unaligned alloc is not small.

Comparing the llvm-objdump output below, aligned operator new has a longer instruction sequence (0x154 for regressed case, 0xec for baseline) -> this increases the number of instructions executed on a hot path (std::stable_sort).

000000000b4702c0 g     F section-name	00000000000000ec tcmalloc_size_returning_operator_new   # new operator used in BASELINE
000000000b4703c0 g     F section-name	0000000000000154 tcmalloc_size_returning_operator_new_aligned
000000000b470540 g     F section-name	00000000000001ec tcmalloc_size_returning_operator_new_hot_cold
000000000b470740 g     F section-name	00000000000002ac tcmalloc_size_returning_operator_new_aligned_hot_cold
000000000b470a00 g     F section-name 00000000000000ec tcmalloc_size_returning_operator_new_nothrow
000000000b470b00 g     F section-name	0000000000000154 tcmalloc_size_returning_operator_new_aligned_nothrow # new operator in EXPERIMENT
000000000b470c80 g     F section-name	00000000000001ec tcmalloc_size_returning_operator_new_hot_cold_nothrow
000000000b470e80 g     F section-name	00000000000002ac tcmalloc_size_returning_operator_new_aligned_hot_cold_nothrow
philnik added inline comments.Jun 26 2023, 11:07 AM
libcxx/include/__memory/uninitialized_buffer.h
72–74

How exactly does that correlate to memory use? I have no idea how TCMalloc is implemented, so your links don't really help me. It could just be the case that the aligned path isn't as well optimized as the non-aligned one, resulting in slightly worse performance. I don't see any theoretical reason the aligned path should be any slower than the non-aligned one. Worst-case should be a single condition to see whether the non-aligned path should be taken if it's actually faster.

mingmingl added inline comments.Jun 26 2023, 11:50 AM
libcxx/include/__memory/uninitialized_buffer.h
72–74

How exactly does that correlate to memory use?

I consulted tcmalloc experts and got the confirmation that this won't increase memory usage; however, using aligned alloc where it's not necessary is going to slow down the execution. I'm not very familiar with the implementation either and misread how alignment affects the size class calculation; sorry about the wrong statement on memory increase.

In the regressed case, aligned operator new does extra work to achieve the same results. To put it the other way, the unaligned operator new can use a compile time constant in its calculations, and saves a runtime cost to choose a size class.

  • This 10-line code snippet and the function should hopefully roughly explain where the additional runtime cost comes from when aligned alloc is used (without a full implementation detail)

Worst-case should be a single condition to see whether the non-aligned path should be taken if it's actually faster.

At a large scale, a single extra if statement here adds up across multiple binaries since stable_sort is used widely.

Could the original unaligned new operator be added for parity?

hans added a comment.Jun 27 2023, 6:50 AM

We're hitting CFI errors in Chromium after this, see https://github.com/llvm/llvm-project/issues/63523

Friendly ping to make sure this doesn't get lost in the other discussion.

Should we partially revert this change? Instead of making use of the new interface for stable_sort, the old get_temporary_buffer can be used (as an internal API), until the performance parity is reached?

We're hitting CFI errors in Chromium after this, see https://github.com/llvm/llvm-project/issues/63523

Friendly ping to make sure this doesn't get lost in the other discussion.

I'm working on adding a CFI run to our CI right now. It looks like I just have to add _LIBCPP_NO_CFI to __make_uninitialized_buffer, given that get_temporary_buffer also has that attribute. I hope to get the patch completed by tomorrow.

Should we partially revert this change? Instead of making use of the new interface for stable_sort, the old get_temporary_buffer can be used (as an internal API), until the performance parity is reached?

I don't see why. If the allocation is actually so costly that it dominates stable_sort, you should either look into whether you actually need stable_sort for your workload or look into getting a better allocator (TCMalloc is probably not bad, so I suspect that's not the problem). Sorting doesn't seem like a task where a single allocation should ever dominate. I'd also like to see the actual use case where a single branch is so problematic that it warrants a revert before doing so. I don't even know how much of an impact this patch actually is in your specific use-case.

We're hitting CFI errors in Chromium after this, see https://github.com/llvm/llvm-project/issues/63523

Friendly ping to make sure this doesn't get lost in the other discussion.

I'm working on adding a CFI run to our CI right now. It looks like I just have to add _LIBCPP_NO_CFI to __make_uninitialized_buffer, given that get_temporary_buffer also has that attribute. I hope to get the patch completed by tomorrow.

Should we partially revert this change? Instead of making use of the new interface for stable_sort, the old get_temporary_buffer can be used (as an internal API), until the performance parity is reached?

I don't see why. If the allocation is actually so costly that it dominates stable_sort, you should either look into whether you actually need stable_sort for your workload or look into getting a better allocator (TCMalloc is probably not bad, so I suspect that's not the problem). Sorting doesn't seem like a task where a single allocation should ever dominate. I'd also like to see the actual use case where a single branch is so problematic that it warrants a revert before doing so. I don't even know how much of an impact this patch actually is in your specific use-case.

My understanding is that problem is caused by the more aggressive use of aligned allocation with this patch (not due to increased memory usage). The old interface has special code (additional branch) to check the overalignment and specialize based on that. Note those check are compile time constant and will be optimized away. In that regard, the new implementation misses out the optimization (specialization) so I believe it is a regression.

The problem happens to be caught by using tcmalloc because it has a very well tuned code path for unaligned case.

hans added a comment.Jun 28 2023, 4:39 AM

I'm working on adding a CFI run to our CI right now. It looks like I just have to add _LIBCPP_NO_CFI to __make_uninitialized_buffer, given that get_temporary_buffer also has that attribute. I hope to get the patch completed by tomorrow.

I see the patch landed as https://github.com/llvm/llvm-project/commit/420a204d52205f1277a8d5df3dbafac6082e02e2 thanks!

EricWF added a subscriber: EricWF.Jun 28 2023, 3:20 PM

@philnik

First, I appreciate the energy you're putting into this project. Your work doesn't go unnoticed.

As for this std::stable_sort topic, I get your position. Ideally, the performance difference between these
two approaches wouldn't be worth discussing. But reality often diverges from theory.

The folks who raised concerns are experiencing substantial performance impacts.
They've presented data to support this - the kind of proactive involvement we strive to cultivate in our community.

However, some of the discourse seems a bit hasty.
Instead of outright dismissing these concerns, let's take a step back and understand the nuances.
If clarity is missing, ask for more details.

It's also worth mentioning that referring to someone's implementation as "bad" may not be the best choice of words.
A more understanding tone can make a difference in promoting a collaborative and respectful community.

Your dedication to this project is appreciated. Let's keep improving it together.

We're hitting CFI errors in Chromium after this, see https://github.com/llvm/llvm-project/issues/63523

Friendly ping to make sure this doesn't get lost in the other discussion.

I'm working on adding a CFI run to our CI right now. It looks like I just have to add _LIBCPP_NO_CFI to __make_uninitialized_buffer, given that get_temporary_buffer also has that attribute. I hope to get the patch completed by tomorrow.

Should we partially revert this change? Instead of making use of the new interface for stable_sort, the old get_temporary_buffer can be used (as an internal API), until the performance parity is reached?

I don't see why. If the allocation is actually so costly that it dominates stable_sort, you should either look into whether you actually need stable_sort for your workload

Forgot to answer the common on the use of stable_sort:

The stable sort is called by another common library not controlled directly by the user. Besides the behavior depends on the input. The problem will manifest when the array to be sorted is small.

or look into getting a better allocator (TCMalloc is probably not bad, so I suspect that's not the problem). Sorting doesn't seem like a task where a single allocation should ever dominate. I'd also like to see the actual use case where a single branch is so problematic that it warrants a revert before doing so. I don't even know how much of an impact this patch actually is in your specific use-case.

@philnik

First, I appreciate the energy you're putting into this project. Your work doesn't go unnoticed.

As for this std::stable_sort topic, I get your position. Ideally, the performance difference between these
two approaches wouldn't be worth discussing. But reality often diverges from theory.

The folks who raised concerns are experiencing substantial performance impacts.

Maybe I've missed it, but I didn't see any data other than "performance got worse".

They've presented data to support this - the kind of proactive involvement we strive to cultivate in our community.

However, some of the discourse seems a bit hasty.
Instead of outright dismissing these concerns, let's take a step back and understand the nuances.

My intention was never to dismiss the concern, but I'd like a specific use-case so I can understand what the problem is.

If clarity is missing, ask for more details.

It's also worth mentioning that referring to someone's implementation as "bad" may not be the best choice of words.
A more understanding tone can make a difference in promoting a collaborative and respectful community.

Your dedication to this project is appreciated. Let's keep improving it together.

Yeah, that might not have been the best choice of words.

Forgot to answer the common on the use of stable_sort:

The stable sort is called by another common library not controlled directly by the user. Besides the behavior depends on the input. The problem will manifest when the array to be sorted is small.

Maybe the better approach would be to improve our __stable_sort_switch heuristic. Would it be possible for you to share the problematic use-case? I suspect you might be using a non-trivially-copy-assignable type which isn't that costly to copy, resulting in small allocations that don't make any sense.

@philnik I rerouted the allocations for temporary buffer to go through the __libcpp_allocate infrastructure. It should address the issue. Please take a look at D154017

@philnik

First, I appreciate the energy you're putting into this project. Your work doesn't go unnoticed.

As for this std::stable_sort topic, I get your position. Ideally, the performance difference between these
two approaches wouldn't be worth discussing. But reality often diverges from theory.

The folks who raised concerns are experiencing substantial performance impacts.

Maybe I've missed it, but I didn't see any data other than "performance got worse".

They've presented data to support this - the kind of proactive involvement we strive to cultivate in our community.

However, some of the discourse seems a bit hasty.
Instead of outright dismissing these concerns, let's take a step back and understand the nuances.

My intention was never to dismiss the concern, but I'd like a specific use-case so I can understand what the problem is.

If clarity is missing, ask for more details.

It's also worth mentioning that referring to someone's implementation as "bad" may not be the best choice of words.
A more understanding tone can make a difference in promoting a collaborative and respectful community.

Your dedication to this project is appreciated. Let's keep improving it together.

Yeah, that might not have been the best choice of words.

Forgot to answer the common on the use of stable_sort:

The stable sort is called by another common library not controlled directly by the user. Besides the behavior depends on the input. The problem will manifest when the array to be sorted is small.

Maybe the better approach would be to improve our __stable_sort_switch heuristic. Would it be possible for you to share the problematic use-case? I suspect you might be using a non-trivially-copy-assignable type which isn't that costly to copy, resulting in small allocations that don't make any sense.

This patch affects multiple benchmarks. I checked one of them again, and it turned out the new calls to the aligned new are from inplace_merge which is also affected in this revision. Coincidentally, the stable_sort and inplace_merge are called back-to-back in the caller.

IMO I agree that if we were using aligned allocation and we're not anymore, this is simply something that went unnoticed in this patch and we should fix it.

Should stable_sort be dominated by an allocation? Probably not, that seems a bit crazy. But nonetheless it seems like we did introduce an unintended difference in this patch and it should be pretty easy to fix. Let's follow up on D152208.

ldionne added a comment.EditedJun 29 2023, 8:16 AM

IMO I agree that if we were using aligned allocation and we're not anymore, this is simply something that went unnoticed in this patch and we should fix it.

Should stable_sort be dominated by an allocation? Probably not, that seems a bit crazy. But nonetheless it seems like we did introduce an unintended difference in this patch and it should be pretty easy to fix. Let's follow up on D152208.

Looks like I read this issue the wrong way around. So we were doing unaligned allocation before, and now we're using aligned allocation (which should be more correct, no?) and it's slower. That's a lot more curious, and I think it's worth investigating why this caused a regression because there might be some sort of perf issue in how your allocator deals with aligned allocation (or maybe on the libc++ side). Anyway, let's follow-up in D152208 to avoid forking the discussion too much.

IMO I agree that if we were using aligned allocation and we're not anymore, this is simply something that went unnoticed in this patch and we should fix it.

Should stable_sort be dominated by an allocation? Probably not, that seems a bit crazy. But nonetheless it seems like we did introduce an unintended difference in this patch and it should be pretty easy to fix. Let's follow up on D152208.

Looks like I read this issue the wrong way around. So we were doing unaligned allocation before, and now we're using aligned allocation (which should be more correct, no?) and it's slower. That's a lot more curious, and I think it's worth investigating why this caused a regression because there might be some sort of perf issue in how your allocator deals with aligned allocation (or maybe on the libc++ side). Anyway, let's follow-up in D152208 to avoid forking the discussion too much.

As I mentioned in the previous reply, the performance regression comes from addition runtime cost (allocation) in the context of inplace_merge call. A little more debugging shows that in the old version, the allocation can be completely elided when buf_size is 0 (see get_temporary_buffer code), but make_uninitialized_buffer will try to invoke operator new regardless.

// inplace_merge code

difference_type __len1 = _IterOps<_AlgPolicy>::distance(__first, __middle);
difference_type __len2 = _IterOps<_AlgPolicy>::distance(__middle, __last);
difference_type __buf_size = _VSTD::min(__len1, __len2);

As I mentioned in the previous reply, the performance regression comes from addition runtime cost (allocation) in the context of inplace_merge call. A little more debugging shows that in the old version, the allocation can be completely elided when buf_size is 0 (see get_temporary_buffer code), but make_uninitialized_buffer will try to invoke operator new regardless.

// inplace_merge code

difference_type __len1 = _IterOps<_AlgPolicy>::distance(__first, __middle);
difference_type __len2 = _IterOps<_AlgPolicy>::distance(__middle, __last);
difference_type __buf_size = _VSTD::min(__len1, __len2);

Ok, so this whole thing has nothing to do with the fact that we are using aligned allocation vs not using aligned allocation. It has to do with the fact that we're allocating, period.

I'm going to revert this patch for now, because @EricWF pointed out that we actually had an issue with the class itself -- if constructing any of the elements in the buffer throws, the destruction is not going to happen properly (we'll over-destroy). This wasn't a problem with the use case that this was meant for (PSTL), since we std::terminate() if the construction of any element would throw. But I agree that is a poor general-purpose API to have. We can take another stab at this next week.

As I mentioned in the previous reply, the performance regression comes from addition runtime cost (allocation) in the context of inplace_merge call. A little more debugging shows that in the old version, the allocation can be completely elided when buf_size is 0 (see get_temporary_buffer code), but make_uninitialized_buffer will try to invoke operator new regardless.

// inplace_merge code

difference_type __len1 = _IterOps<_AlgPolicy>::distance(__first, __middle);
difference_type __len2 = _IterOps<_AlgPolicy>::distance(__middle, __last);
difference_type __buf_size = _VSTD::min(__len1, __len2);

Ok, so this whole thing has nothing to do with the fact that we are using aligned allocation vs not using aligned allocation. It has to do with the fact that we're allocating, period.

Correct.

I'm going to revert this patch for now, because @EricWF pointed out that we actually had an issue with the class itself -- if constructing any of the elements in the buffer throws, the destruction is not going to happen properly (we'll over-destroy). This wasn't a problem with the use case that this was meant for (PSTL), since we std::terminate() if the construction of any element would throw. But I agree that is a poor general-purpose API to have. We can take another stab at this next week.

IMO I agree that if we were using aligned allocation and we're not anymore, this is simply something that went unnoticed in this patch and we should fix it.

Should stable_sort be dominated by an allocation? Probably not, that seems a bit crazy. But nonetheless it seems like we did introduce an unintended difference in this patch and it should be pretty easy to fix. Let's follow up on D152208.

Looks like I read this issue the wrong way around. So we were doing unaligned allocation before, and now we're using aligned allocation (which should be more correct, no?) and it's slower. That's a lot more curious, and I think it's worth investigating why this caused a regression because there might be some sort of perf issue in how your allocator deals with aligned allocation (or maybe on the libc++ side). Anyway, let's follow-up in D152208 to avoid forking the discussion too much.

As I mentioned in the previous reply, the performance regression comes from addition runtime cost (allocation) in the context of inplace_merge call. A little more debugging shows that in the old version, the allocation can be completely elided when buf_size is 0 (see get_temporary_buffer code), but make_uninitialized_buffer will try to invoke operator new regardless.

// inplace_merge code

difference_type __len1 = _IterOps<_AlgPolicy>::distance(__first, __middle);
difference_type __len2 = _IterOps<_AlgPolicy>::distance(__middle, __last);
difference_type __buf_size = _VSTD::min(__len1, __len2);

thanks for debugging this! I wish I continued to get a detailed call-graph when seeing the common caller instruction cycle increases rather than guessing it's stable_sort :(

philnik reopened this revision.Jul 3 2023, 11:04 AM
This revision is now accepted and ready to land.Jul 3 2023, 11:04 AM