This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Introsort based sorting function
AbandonedPublic

Authored by philnik on Aug 7 2017, 12:36 PM.

Download Raw Diff

Details

Reviewers

bcraig
rmaprath
compnerd
EricWF
mclow.lists
DIVYA

Summary

The sorting algorithm currently employed in libc+ library uses quicksort with tail recursion elimination, as a result of which the worst case complexity turns out to be O(N^2).
This patch reduces the worst case time complexity, by employing Introsort algorithm. Introsort is a sorting technique, which begins with quicksort and when the recursion depth (or depth limit) goes beyond a threshold value, then it switches to Heapsort .As a result the worst case complexity reduces to O(NlogN)

Worked in collaboration with Aditya Kumar.

Diff Detail

Event Timeline

DIVYA created this revision.Aug 7 2017, 12:36 PM

This patch needs benchmarks that demonstrate the performance changes.

alternatively, you could report the comparison of the old code vs. the new code with an existing benchmark, like benchmarks/algorithms.bench.cpp

benchmarks/algorithms.bench.cpp Results

CPU Time With old code (in ns)

BM_sort_std_common<std::vector<int>>/16384 : 730752
BM_sort_std_common<std::vector<int>>/32768 : 1.58E+06
BM_sort_std_ascending<std::vector<int>>/16384 : 17160.5
BM_sort_std_ascending<std::vector<int>>/32768 : 35350.1
BM_sort_std_descending<std::vector<int>>/16384 : 35809
BM_sort_std_descending<std::vector<int>>/32768 : 72133
BM_sort_std_list_with_vector<std::list<int>>/16384 : 124250
BM_sort_std_list_with_vector<std::list<int>>/32768 : 247705
BM_sort_std_worst_quick<std::vector<int>>/16384 : 1.03E+07
BM_sort_std_worst_quick<std::vector<int>>/32768 : 4.04E+07

CPU Time With new code (in ns)

BM_sort_std_common<std::vector<int>>/16384 : 720510
BM_sort_std_common<std::vector<int>>/32768 : 1.55E+06
BM_sort_std_ascending<std::vector<int>>/16384 : 17164.9
BM_sort_std_ascending<std::vector<int>>/32768 : 34726.7
BM_sort_std_descending<std::vector<int>>/16384 : 35671
BM_sort_std_descending<std::vector<int>>/32768 : 72100.7
BM_sort_std_list_with_vector<std::list<int>>/16384 : 125816
BM_sort_std_list_with_vector<std::list<int>>/32768 : 247450
BM_sort_std_worst_quick<std::vector<int>>/16384 : 987016
BM_sort_std_worst_quick<std::vector<int>>/32768 : 2.14E+06

Those are interesting (and useful) results... but they don't look like they came from the same algorithms.bench.cpp that I'm looking at...
https://github.com/llvm-mirror/libcxx/blob/master/benchmarks/algorithms.bench.cpp

That being said, the benchmark there only does 1k elements at a time, and doesn't have the worst case for quick sort like yours does. Adding to the upstream algorithms.bench.cpp seems valuable.

I like this change in general. Dinkumware has been using introsort for 10+ years, so I'm a bit surprised that libc++ wasn't already.

include/algorithm
4208	This comment says basically the same thing as the code. The comment would be more useful if it said why 2*log2(size) is used.

Link to algorithm.bench.cpp benchmark
https://github.com/hiraditya/std-benchmark/blob/master/cxx/algorithm.bench.cpp

include/algorithm
4208	We tested the code with depth limit from log2(size) to 4log2(size).It was giving good performance around 2log2(size).So the depth limit was fixed a this value.

If you want the performance improvements in the BM_sort_std_worst_quick cases preserved, you really need to port the benchmark from Aditya's repo into the libcxx benchmark code base. Otherwise, the next person that comes along to improve std::sort performance may very well wreck the performance gains you achieved.

Added benchmark from Aditya's repo into the libcxx benchmark code base

The test headers should not be in the production include folder. They should probably be in the benchmark folder.

If possible, follow the style and conventions of the existing tests. When you can't follow the style and convention of the existing tests, try to make it obvious (or leave a comment) as to why the new test is special.

For example, one way to follow the existing conventions would be to have a test that looks like this...

BENCHMARK_CAPTURE(BM_Sort, qsort_worst_uint32, getQSortKiller<uint32_t>)->Arg(TestNumInputs);

That test would be called qsort_worst_uint32. You would need to author the sequence generator so that it had a signature like this...
template <class T> std::vector<T> getQSortKiller(size_t N)

On an encouraging note, I don't see anything wrong with the production code. I'm optimistic about getting this in once we iron out the test issues.

Results with the patch.

Before:

Run on (8 X 3900 MHz CPU s)
2017-08-20 15:11:41
-------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations
-------------------------------------------------------------------------------
BM_Sort/random_uint32/65536               14202353 ns   14203202 ns         48
BM_Sort/sorted_ascending_uint32/65536       254100 ns     254108 ns       2754
BM_Sort/sorted_descending_uint32/65536      552118 ns     552151 ns       1232
BM_Sort/single_element_uint32/65536         170140 ns     170136 ns       4090
BM_Sort/pipe_organ_uint32/65536            5989117 ns    5989494 ns        113
BM_Sort/random_strings/65536             105697682 ns  105702553 ns          7
BM_Sort/sorted_ascending_strings/65536    13324109 ns   13324186 ns         50
BM_Sort/sorted_descending_strings/65536   19057303 ns   19058005 ns         36
BM_Sort/single_element_strings/65536      57941433 ns   57944691 ns         12
BM_Sort/qsort_worst_uint32/65536         694858550 ns  694894213 ns          1

After:

Run on (8 X 3900 MHz CPU s)
2017-08-20 15:15:14
-------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations
-------------------------------------------------------------------------------
BM_Sort/random_uint32/65536               14073209 ns   14073732 ns         49
BM_Sort/sorted_ascending_uint32/65536       257596 ns     257610 ns       2740
BM_Sort/sorted_descending_uint32/65536      560208 ns     560069 ns       1226
BM_Sort/single_element_uint32/65536         170543 ns     170549 ns       4075
BM_Sort/pipe_organ_uint32/65536            6008832 ns    6009173 ns        113
BM_Sort/random_strings/65536             104672888 ns  104677220 ns          7
BM_Sort/sorted_ascending_strings/65536    13334016 ns   13334393 ns         54
BM_Sort/sorted_descending_strings/65536   18883275 ns   18883831 ns         37
BM_Sort/single_element_strings/65536      57022905 ns   57025206 ns         12
BM_Sort/qsort_worst_uint32/65536          16870788 ns   16871828 ns         41

The last test which exploits the worst case behavior in quick sort improves greatly while others are mostly unaffected.

added test qsort_worst_uint32 in algorithm.bench.cpp

LGTM. You should probably get one other person to approve though. I'm hoping that @EricWF or @mclow.lists will take a look

Ping!

ping

hiraditya mentioned this in D113413: Add introsort to avoid O(n^2) behavior and a benchmark for adversarial quick sort input..May 31 2022, 5:12 PM

This has been superseded by D113413.

Herald added a project: Restricted Project. · View Herald TranscriptJun 1 2022, 12:42 AM

Herald added a subscriber: mgrang. · View Herald Transcript

philnik abandoned this revision.Jun 1 2022, 12:42 AM

Revision Contents

Path

Size

benchmarks/

GenerateInput.hpp

35 lines

algorithms.bench.cpp

5 lines

include/

algorithm

38 lines

Diff 111998

benchmarks/GenerateInput.hpp

	Show First 20 Lines • Show All 132 Lines • ▼ Show 20 Lines
	inline std::vector<const char*> getRandomCStringInputs(size_t N) {			inline std::vector<const char*> getRandomCStringInputs(size_t N) {
	static std::vector<std::string> inputs = getRandomStringInputs(N);			static std::vector<std::string> inputs = getRandomStringInputs(N);
	std::vector<const char*> cinputs;			std::vector<const char*> cinputs;
	for (auto const& str : inputs)			for (auto const& str : inputs)
	cinputs.push_back(str.c_str());			cinputs.push_back(str.c_str());
	return cinputs;			return cinputs;
	}			}

				template <class T>
				inline std::vector<T> make_killer(size_t N) {
				std::vector<T> inputs;
				uint32_t candidate = 0;
				uint32_t num_solid = 0;
				uint32_t gas = N - 1;

				std::vector<T> tmp(N);
				inputs.resize(N);

				for (T i = 0; i < N; ++i) {
				tmp[i] = i;
				inputs[i] = gas;
				}

				std::sort(tmp.begin(), tmp.end(), [&](T x, T y) {
				if (inputs[x] == gas && inputs[y] == gas) {
				if (x == candidate) inputs[x] = num_solid++;
				else inputs[y] = num_solid++;
				}

				if (inputs[x] == gas) candidate = x;
				else if (inputs[y] == gas) candidate = y;

				return inputs[x] < inputs[y];
				});
				return inputs;
				}


				template <class T>
				inline std::vector<T> getQSortKiller(size_t N){
				std::vector<T> inputs = make_killer<T>(N);
				return inputs;
				}

	#endif // BENCHMARK_GENERATE_INPUT_HPP			#endif // BENCHMARK_GENERATE_INPUT_HPP

benchmarks/algorithms.bench.cpp

#include <unordered_set>		#include <unordered_set>
#include <vector>		#include <vector>
#include <cstdint>		#include <cstdint>

#include "benchmark/benchmark_api.h"		#include "benchmark/benchmark_api.h"
#include "GenerateInput.hpp"		#include "GenerateInput.hpp"

constexpr std::size_t TestNumInputs = 1024;		constexpr std::size_t TestNumInputs = 1024*64;

template <class GenInputs>		template <class GenInputs>
void BM_Sort(benchmark::State& st, GenInputs gen) {		void BM_Sort(benchmark::State& st, GenInputs gen) {
using ValueType = typename decltype(gen(0))::value_type;		using ValueType = typename decltype(gen(0))::value_type;
const auto in = gen(st.range(0));		const auto in = gen(st.range(0));
std::vector<ValueType> inputs[5];		std::vector<ValueType> inputs[5];
auto reset_inputs = [&]() {		auto reset_inputs = [&]() {
for (auto& C : inputs) {		for (auto& C : inputs) {
Show All 36 Lines	BENCHMARK_CAPTURE(BM_Sort, sorted_ascending_strings,
getSortedStringInputs)->Arg(TestNumInputs);		getSortedStringInputs)->Arg(TestNumInputs);

BENCHMARK_CAPTURE(BM_Sort, sorted_descending_strings,		BENCHMARK_CAPTURE(BM_Sort, sorted_descending_strings,
getReverseSortedStringInputs)->Arg(TestNumInputs);		getReverseSortedStringInputs)->Arg(TestNumInputs);

BENCHMARK_CAPTURE(BM_Sort, single_element_strings,		BENCHMARK_CAPTURE(BM_Sort, single_element_strings,
getDuplicateStringInputs)->Arg(TestNumInputs);		getDuplicateStringInputs)->Arg(TestNumInputs);

		BENCHMARK_CAPTURE(BM_Sort, qsort_worst_uint32,
		getQSortKiller<uint32_t>)->Arg(TestNumInputs);


BENCHMARK_MAIN()		BENCHMARK_MAIN()

include/algorithm

Show First 20 Lines • Show All 636 Lines • ▼ Show 20 Lines

#include <__config>		#include <__config>
#include <initializer_list>		#include <initializer_list>
#include <type_traits>		#include <type_traits>
#include <cstring>		#include <cstring>
#include <utility> // needed to provide swap_ranges.		#include <utility> // needed to provide swap_ranges.
#include <memory>		#include <memory>
#include <iterator>		#include <iterator>
		#include <cmath>
#include <cstddef>		#include <cstddef>

#if defined(__IBMCPP__)		#if defined(__IBMCPP__)
#include "support/ibm/support.h"		#include "support/ibm/support.h"
#endif		#endif
#if defined(_LIBCPP_COMPILER_MSVC)		#if defined(_LIBCPP_COMPILER_MSVC)
#include <intrin.h>		#include <intrin.h>
#endif		#endif
▲ Show 20 Lines • Show All 3,336 Lines • ▼ Show 20 Lines	if (__first1 != __last1)
}		}
}		}
__h.release();		__h.release();
}		}
}		}

template <class _Compare, class _RandomAccessIterator>		template <class _Compare, class _RandomAccessIterator>
void		void
__sort(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp)		__partial_sort(_RandomAccessIterator, _RandomAccessIterator, _RandomAccessIterator,
		_Compare);

		// Using introsort algorithm for sorting
		template <class _Compare, class _RandomAccessIterator>
		void
		__intro_sort(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp,
		typename iterator_traits<_RandomAccessIterator>::difference_type __depth_limit)
{		{
// _Compare is known to be a reference type		// _Compare is known to be a reference type
typedef typename iterator_traits<_RandomAccessIterator>::difference_type difference_type;		typedef typename iterator_traits<_RandomAccessIterator>::difference_type difference_type;
typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type;		typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type;
const difference_type __limit = is_trivially_copy_constructible<value_type>::value &&		const difference_type __limit = is_trivially_copy_constructible<value_type>::value &&
is_trivially_copy_assignable<value_type>::value ? 30 : 6;		is_trivially_copy_assignable<value_type>::value ? 30 : 6;
while (true)		while (true)
{		{
Show All 18 Lines	__restart:
_VSTD::__sort5<_Compare>(__first, __first+1, __first+2, __first+3, --__last, __comp);		_VSTD::__sort5<_Compare>(__first, __first+1, __first+2, __first+3, --__last, __comp);
return;		return;
}		}
if (__len <= __limit)		if (__len <= __limit)
{		{
_VSTD::__insertion_sort_3<_Compare>(__first, __last, __comp);		_VSTD::__insertion_sort_3<_Compare>(__first, __last, __comp);
return;		return;
}		}
		if (__depth_limit == 0)
		{
		__partial_sort<_Compare>(__first,__last,__last,__comp);
		return;
		}

// __len > 5		// __len > 5
_RandomAccessIterator __m = __first;		_RandomAccessIterator __m = __first;
_RandomAccessIterator __lm1 = __last;		_RandomAccessIterator __lm1 = __last;
--__lm1;		--__lm1;
unsigned __n_swaps;		unsigned __n_swaps;
{		{
difference_type __delta;		difference_type __delta;
if (__len >= 1000)		if (__len >= 1000)
▲ Show 20 Lines • Show All 127 Lines • ▼ Show 20 Lines	__restart:
__first = ++__i;		__first = ++__i;
continue;		continue;
}		}
}		}
}		}
// sort smaller range with recursive call and larger with tail recursion elimination		// sort smaller range with recursive call and larger with tail recursion elimination
if (__i - __first < __last - __i)		if (__i - __first < __last - __i)
{		{
_VSTD::__sort<_Compare>(__first, __i, __comp);		_VSTD::__intro_sort<_Compare>(__first, __i, __comp, __depth_limit);
// _VSTD::__sort<_Compare>(__i+1, __last, __comp);		// _VSTD::__intro_sort<_Compare>(__i+1, __last, __comp, __depth_limit);
__first = ++__i;		__first = ++__i;
}		}
else		else
{		{
_VSTD::__sort<_Compare>(__i+1, __last, __comp);		_VSTD::__intro_sort<_Compare>(__i+1, __last, __comp, __depth_limit);
// _VSTD::__sort<_Compare>(__first, __i, __comp);		// _VSTD::__intro_sort<_Compare>(__first, __i, __comp, __depth_limit);
__last = __i;		__last = __i;
}		}
		--__depth_limit;
}		}
}		}

		template <class _Compare, class _RandomAccessIterator>
		void
		__sort(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp)
		{

		// Threshold(or depth limit) for introsort is taken to be 2*log2(size)
		bcraigUnsubmitted Done Reply Inline Actions This comment says basically the same thing as the code. The comment would be more useful if it said why 2log2(size) is used. bcraig:* This comment says basically the same thing as the code. The comment would be more useful if it…
		DIVYAUnsubmitted Not Done Reply Inline Actions We tested the code with depth limit from log2(size) to 4log2(size).It was giving good performance around 2log2(size).So the depth limit was fixed a this value. DIVYA: We tested the code with depth limit from log2(size) to 4*log2(size).It was giving good…
		typedef typename iterator_traits<_RandomAccessIterator>::difference_type difference_type;
		const difference_type __dp = __last - __first;
		difference_type __depth_limit = __last == __first ? 0 : _VSTD::log2(__dp);
		__depth_limit *= 2;
		__intro_sort<_Compare>(__first, __last, __comp, __depth_limit);
		}

// This forwarder keeps the top call and the recursive calls using the same instantiation, forcing a reference _Compare		// This forwarder keeps the top call and the recursive calls using the same instantiation, forcing a reference _Compare
template <class _RandomAccessIterator, class _Compare>		template <class _RandomAccessIterator, class _Compare>
inline _LIBCPP_INLINE_VISIBILITY		inline _LIBCPP_INLINE_VISIBILITY
void		void
sort(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp)		sort(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp)
{		{
#ifdef _LIBCPP_DEBUG		#ifdef _LIBCPP_DEBUG
typedef typename add_lvalue_reference<__debug_less<_Compare> >::type _Comp_ref;		typedef typename add_lvalue_reference<__debug_less<_Compare> >::type _Comp_ref;
▲ Show 20 Lines • Show All 1,709 Lines • Show Last 20 Lines