This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libcxx/include/__algorithm/
-
include/
-
__algorithm/
6/7
sort.h

Differential D125958

[libc++] Use __libcpp_clz for a tighter __log2i function
ClosedPublic

Authored by hans on May 19 2022, 2:49 AM.

Download Raw Diff

Details

Reviewers

nilayvaish
ldionne
philnik

Group Reviewers

Restricted Project

Commits

rG865ad6bd2165: [libc++] Use __libcpp_clz for a tighter __log2i function

Summary

While looking at D113413 I noticed that __log2i could perhaps be improved to be both slightly smaller and faster.

I'm not familiar with how to run benchmarks for this if there are any, but https://godbolt.org/z/48KT3z5Tj looks like an improvement to me.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hans created this revision.May 19 2022, 2:49 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 2:49 AM

hans requested review of this revision.May 19 2022, 2:49 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 2:49 AM

Herald added 1 blocking reviewer(s): Restricted Project. · View Herald Transcript

Herald added a subscriber: libcxx-commits. · View Herald Transcript

Harbormaster completed remote builds in B165286: Diff 430617.May 19 2022, 4:17 AM

ayzhao added a subscriber: ayzhao.May 19 2022, 9:46 AM

Can you take a look at the failed test?

libcxx/include/__algorithm/sort.h
506	I am wondering whether CHAR_BIT is preferred over using CHAR_BIT (available from climit header file). I don't know what the convention is for writing library code.
511–517	Instead of retaining this, should we just have a static_assert() on the size of the type being less or equal to that of unsigned long long?

This revision now requires changes to proceed.May 21 2022, 12:31 PM

The CI failue look unrelated, but I can't recall any problems with the AArch64 CI. Just rebase it and try again. I don't think there are any benchmarks, but it would be nice if you could provide some in libcxx/benchmarks. I'd also like to see benchmark results before approving.

libcxx/include/__algorithm/sort.h
506	I think `CHAR_BIT` is the preferred way. At least that's what I'm using.
511–517	We support `__int128`, so this needs to stay.

philnik retitled this revision from Use __libcpp_clz for a tighter __log2i function to [libc++] Use __libcpp_clz for a tighter __log2i function.May 21 2022, 12:40 PM

Here are some benchmark numbers (requires D126297). I ran build.release/projects/libcxx/benchmarks/sort.libcxx.out --benchmark_filter=BM_Sort_uint32_QuickSortAdversary*.

Before my patch:

-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
BM_Sort_uint32_QuickSortAdversary_1            4.51 ns         4.51 ns    154664960
BM_Sort_uint32_QuickSortAdversary_4            1.97 ns         1.97 ns    357040128
BM_Sort_uint32_QuickSortAdversary_16          0.941 ns        0.941 ns    742129664
BM_Sort_uint32_QuickSortAdversary_64           11.0 ns         11.0 ns     63176704
BM_Sort_uint32_QuickSortAdversary_256          18.1 ns         18.1 ns     38535168
BM_Sort_uint32_QuickSortAdversary_1024         33.1 ns         33.1 ns     21233664
BM_Sort_uint32_QuickSortAdversary_16384        49.4 ns         49.4 ns     14155776
BM_Sort_uint32_QuickSortAdversary_262144       59.9 ns         59.9 ns     11796480

After my patch:

-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
BM_Sort_uint32_QuickSortAdversary_1            4.81 ns         4.81 ns    145227776
BM_Sort_uint32_QuickSortAdversary_4            1.82 ns         1.82 ns    379322368
BM_Sort_uint32_QuickSortAdversary_16          0.849 ns        0.849 ns    820772864
BM_Sort_uint32_QuickSortAdversary_64           11.6 ns         11.6 ns     60293120
BM_Sort_uint32_QuickSortAdversary_256          19.5 ns         19.5 ns     36175872
BM_Sort_uint32_QuickSortAdversary_1024         36.0 ns         36.0 ns     19660800
BM_Sort_uint32_QuickSortAdversary_16384        53.2 ns         53.2 ns     13369344
BM_Sort_uint32_QuickSortAdversary_262144       63.7 ns         63.7 ns     11272192

Looks like some numbers are up and some are down. I'd say this is perf neutral.

But it does have a small effect on size. The text of sort.libcxx.out goes from 509310 to 508542 bytes, so 768 bytes saved.

I also tried this on a "release" build of Chrome (without ThinLTO or PGO, so doesn't exactly match what we ship), and the text went from 259288968 to 259282312 bytes, so 6.5 KB saved.

libcxx/include/__algorithm/sort.h
506	It seems both are used roughly as often. $ git grep -In [^_]CHAR_BIT -- libcxx/include \| wc -l 25 $ git grep -In __CHAR_BIT__ -- libcxx/include \| wc -l 22 Given that `__CHAR_BIT__` avoids potentially pushing another header include on every user of `<sort>`, I'd prefer to stay with it.

I'd also like to see a clean CI run.

libcxx/include/__algorithm/sort.h
506	I think that's manageable. Preprocessing `<climits>` produces 104 LOC while preprocessing `<algorithm>` produces 32888 LOC. That's not even a drop in the ocean. And climits is already included anyways. Using `CHAR_BIT` just makes the code a bit nicer to read. IMO we should actually replace the uses of `__CHAR_BIT__`. If you are worried about compile times you should use `-fmodules`.

Rebase to kick the CI.

Harbormaster completed remote builds in B166076: Diff 431705.May 24 2022, 12:30 PM

hans added inline comments.May 25 2022, 7:08 AM

libcxx/include/__algorithm/sort.h
506	I suppose if <climits> is already included, it doesn't matter in this instance. I'll leave it to someone else to decide whether `CHAR_BIT` or `__CHAR_BIT__` should be preferred in libc++ in general. If you are worried about compile times you should use -fmodules. If only things were that simple :-)

Switch to CHAR_BIT from <climits>.

Please take another look.

Harbormaster completed remote builds in B166262: Diff 431973.May 25 2022, 11:57 AM

nilayvaish accepted this revision.May 25 2022, 9:03 PM

@philnik: Does it look good to you to?

Yes.

This revision is now accepted and ready to land.May 27 2022, 9:03 AM

Closed by commit rG865ad6bd2165: [libc++] Use __libcpp_clz for a tighter __log2i function (authored by hans). · Explain WhyMay 27 2022, 9:58 AM

This revision was automatically updated to reflect the committed changes.

hans added a commit: rG865ad6bd2165: [libc++] Use __libcpp_clz for a tighter __log2i function.

Revision Contents

Path

Size

libcxx/

include/

__algorithm/

sort.h

11 lines

Diff 432594

libcxx/include/__algorithm/sort.h

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef _LIBCPP___ALGORITHM_SORT_H		#ifndef _LIBCPP___ALGORITHM_SORT_H
#define _LIBCPP___ALGORITHM_SORT_H		#define _LIBCPP___ALGORITHM_SORT_H

#include <__algorithm/comp.h>		#include <__algorithm/comp.h>
#include <__algorithm/comp_ref_type.h>		#include <__algorithm/comp_ref_type.h>
#include <__algorithm/min_element.h>		#include <__algorithm/min_element.h>
#include <__algorithm/partial_sort.h>		#include <__algorithm/partial_sort.h>
#include <__algorithm/unwrap_iter.h>		#include <__algorithm/unwrap_iter.h>
		#include <__bits>
#include <__config>		#include <__config>
#include <__utility/swap.h>		#include <__utility/swap.h>
		#include <climits>
#include <memory>		#include <memory>

#if defined(_LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY)		#if defined(_LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY)
# include <__algorithm/shuffle.h>		# include <__algorithm/shuffle.h>
#endif		#endif

#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)		#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
# pragma GCC system_header		# pragma GCC system_header
▲ Show 20 Lines • Show All 466 Lines • ▼ Show 20 Lines	if (__i - __first < __last - __i) {
_VSTD::__introsort<_Compare>(__i + difference_type(1), __last, __comp, __depth);		_VSTD::__introsort<_Compare>(__i + difference_type(1), __last, __comp, __depth);
__last = __i;		__last = __i;
}		}
}		}
}		}

template <typename _Number>		template <typename _Number>
inline _LIBCPP_HIDE_FROM_ABI _Number __log2i(_Number __n) {		inline _LIBCPP_HIDE_FROM_ABI _Number __log2i(_Number __n) {
		if (__n == 0)
		return 0;
		if (sizeof(__n) <= sizeof(unsigned))
		return sizeof(unsigned) * CHAR_BIT - 1 - __libcpp_clz(static_cast<unsigned>(__n));
		nilayvaishUnsubmitted Done Reply Inline Actions I am wondering whether CHAR_BIT is preferred over using CHAR_BIT (available from climit header file). I don't know what the convention is for writing library code. nilayvaish: I am wondering whether __CHAR_BIT__ is preferred over using CHAR_BIT (available from climit…
		philnikUnsubmitted Done Reply Inline Actions I think `CHAR_BIT` is the preferred way. At least that's what I'm using. philnik: I think `CHAR_BIT` is the preferred way. At least that's what I'm using.
		hansAuthorUnsubmitted Done Reply Inline Actions It seems both are used roughly as often. $ git grep -In [^_]CHAR_BIT -- libcxx/include \| wc -l 25 $ git grep -In __CHAR_BIT__ -- libcxx/include \| wc -l 22 Given that `__CHAR_BIT__` avoids potentially pushing another header include on every user of `<sort>`, I'd prefer to stay with it. hans: It seems both are used roughly as often. ``` $ git grep -In [^_]CHAR_BIT -- libcxx/include \|…
		philnikUnsubmitted Not Done Reply Inline Actions I think that's manageable. Preprocessing `<climits>` produces 104 LOC while preprocessing `<algorithm>` produces 32888 LOC. That's not even a drop in the ocean. And climits is already included anyways. Using `CHAR_BIT` just makes the code a bit nicer to read. IMO we should actually replace the uses of `__CHAR_BIT__`. If you are worried about compile times you should use `-fmodules`. philnik: I think that's manageable. Preprocessing `<climits>` produces 104 LOC while preprocessing…
		hansAuthorUnsubmitted Done Reply Inline Actions I suppose if <climits> is already included, it doesn't matter in this instance. I'll leave it to someone else to decide whether `CHAR_BIT` or `__CHAR_BIT__` should be preferred in libc++ in general. If you are worried about compile times you should use -fmodules. If only things were that simple :-) hans: I suppose if <climits> is already included, it doesn't matter in this instance. I'll leave it…
		if (sizeof(__n) <= sizeof(unsigned long))
		return sizeof(unsigned long) * CHAR_BIT - 1 - __libcpp_clz(static_cast<unsigned long>(__n));
		if (sizeof(__n) <= sizeof(unsigned long long))
		return sizeof(unsigned long long) * CHAR_BIT - 1 - __libcpp_clz(static_cast<unsigned long long>(__n));

_Number __log2 = 0;		_Number __log2 = 0;
while (__n > 1) {		while (__n > 1) {
__log2++;		__log2++;
__n >>= 1;		__n >>= 1;
}		}
return __log2;		return __log2;
		nilayvaishUnsubmitted Done Reply Inline Actions Instead of retaining this, should we just have a static_assert() on the size of the type being less or equal to that of unsigned long long? nilayvaish: Instead of retaining this, should we just have a static_assert() on the size of the type being…
		philnikUnsubmitted Done Reply Inline Actions We support `__int128`, so this needs to stay. philnik: We support `__int128`, so this needs to stay.
}		}

template <class _Compare, class _RandomAccessIterator>		template <class _Compare, class _RandomAccessIterator>
void __sort(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp) {		void __sort(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp) {
typedef typename iterator_traits<_RandomAccessIterator>::difference_type difference_type;		typedef typename iterator_traits<_RandomAccessIterator>::difference_type difference_type;
difference_type __depth_limit = 2 * __log2i(__last - __first);		difference_type __depth_limit = 2 * __log2i(__last - __first);
_VSTD::__introsort<_Compare>(__first, __last, __comp, __depth_limit);		_VSTD::__introsort<_Compare>(__first, __last, __comp, __depth_limit);
}		}
▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines