This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
compiler-rt/lib/builtins/
-
lib/
-
builtins/
-
udivmodti4.c

Differential D83547

[builtins] Optimize the 128-bit division according to "Improved division by invariant integers" paper
AbandonedPublic

Authored by danlark on Jul 10 2020, 5:34 AM.

Download Raw Diff

Details

Reviewers

courbet
MaskRay
howard.hinnant

Summary

Optimize the 128-bit division according to https://gmplib.org/~tege/division-paper.pdf

Old benchmark:

-------------------------------------------------------------------------------------------
Benchmark                                                    Time           CPU Iterations
-------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<__uint128_t>            12 ns         12 ns   57176570
BM_DivideIntrinsic128UniformDivisor<__int128_t>             14 ns         14 ns   49819848
BM_RemainderIntrinsic128UniformDivisor<__uint128_t>         13 ns         13 ns   54875157
BM_RemainderIntrinsic128UniformDivisor<__int128_t>          14 ns         14 ns   51453585
BM_DivideIntrinsic128SmallDivisor<__uint128_t>              27 ns         27 ns   25857150
BM_DivideIntrinsic128SmallDivisor<__int128_t>               29 ns         29 ns   23769665
BM_RemainderIntrinsic128SmallDivisor<__uint128_t>           28 ns         28 ns   25264879
BM_RemainderIntrinsic128SmallDivisor<__int128_t>            30 ns         30 ns   23714126

New benchmark

-------------------------------------------------------------------------------------------
Benchmark                                                    Time           CPU Iterations
-------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<__uint128_t>            12 ns         12 ns   56122341
BM_DivideIntrinsic128UniformDivisor<__int128_t>             14 ns         14 ns   50306881
BM_RemainderIntrinsic128UniformDivisor<__uint128_t>         13 ns         13 ns   56006931
BM_RemainderIntrinsic128UniformDivisor<__int128_t>          13 ns         13 ns   52443568
BM_DivideIntrinsic128SmallDivisor<__uint128_t>              13 ns         13 ns   53636689
BM_DivideIntrinsic128SmallDivisor<__int128_t>               15 ns         15 ns   45276581
BM_RemainderIntrinsic128SmallDivisor<__uint128_t>           15 ns         15 ns   46067118
BM_RemainderIntrinsic128SmallDivisor<__int128_t>            17 ns         17 ns   40513535

PowerPC and ARM benchmarks are the same or slightly better.

I haven't proceeded it with the previous patch because this one contains 512 byte table. If you think this is not acceptable in builtins library, I'll just revert it then.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	420 ms	linux > MemorySanitizer-X86_64.MemorySanitizer-X86_64::Unknown Unit Message ("")
	480 ms	linux > MemorySanitizer-X86_64.MemorySanitizer-X86_64::Unknown Unit Message ("")
	530 ms	linux > SanitizerCommon-asan-x86_64-Linux.Linux::Unknown Unit Message ("")
	320 ms	linux > SanitizerCommon-asan-x86_64-Linux.Posix::Unknown Unit Message ("")
	300 ms	linux > SanitizerCommon-lsan-x86_64-Linux.Linux::Unknown Unit Message ("")
		View Full Test Results (13 Failed)

Event Timeline

danlark created this revision.Jul 10 2020, 5:34 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2020, 5:34 AM

Herald added subscribers: Restricted Project, steven.zhang, kristof.beyls. · View Herald Transcript

danlark retitled this revision from Optimize the 128-bit division according to https://gmplib.org/~tege/division-paper.pdf to [builtins] Optimize the 128-bit division according to "Improved division by invariant integers" paper.Jul 10 2020, 5:35 AM

danlark edited the summary of this revision. (Show Details)

I have no idea what the policy is WRT size for compiler-rt. How big is the diff if you take code size into consideration ?

In D83547#2143840, @courbet wrote:

I have no idea what the policy is WRT size for compiler-rt. How big is the diff if you take code size into consideration ?

du -shb ./build/lib/clang/11.0.0/lib/linux/libclang_rt.builtins-x86_64_old.a
223538	./build/lib/clang/11.0.0/lib/linux/libclang_rt.builtins-x86_64_old.a

du -shb ./build/lib/clang/11.0.0/lib/linux/libclang_rt.builtins-x86_64.a
224658	./build/lib/clang/11.0.0/lib/linux/libclang_rt.builtins-x86_64.a

+1.1kb

Provide better names than in paper

danlark added a reviewer: howard.hinnant.Jul 10 2020, 5:57 AM

Harbormaster failed remote builds in B63732: Diff 277001!Jul 10 2020, 6:04 AM

Harbormaster failed remote builds in B63738: Diff 277008!Jul 10 2020, 6:37 AM

I need to double check the cold cache performance, it may happen that it is slower in that case

danlark abandoned this revision.May 15 2021, 2:48 AM

Revision Contents

Path

Size

compiler-rt/

lib/

builtins/

udivmodti4.c

152 lines

Diff 277008

compiler-rt/lib/builtins/udivmodti4.c

	//===-- udivmodti4.c - Implement __udivmodti4 -----------------------------===//			//===-- udivmodti4.c - Implement __udivmodti4 -----------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements __udivmodti4 for the compiler_rt library.			// This file implements __udivmodti4 for the compiler_rt library.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "int_lib.h"			#include "int_lib.h"

	#ifdef CRT_HAS_128BIT			#ifdef CRT_HAS_128BIT

	// Returns the 128 bit division result by 64 bit. Result must fit in 64 bits.			#define RECIPROCAL_TABLE_ITEM(d) \
	// Remainder stored in r.			(unsigned short)(0x7fd00 / (0x100 \| (unsigned char)d))
	// Taken and adjusted from libdivide libdivide_128_div_64_to_64 division
	// fallback. For a correctness proof see the reference for this algorithm			#define REPEAT4(x) \
	// in Knuth, Volume 2, section 4.3.1, Algorithm D.			RECIPROCAL_TABLE_ITEM((x) + 0), RECIPROCAL_TABLE_ITEM((x) + 1), \
	UNUSED			RECIPROCAL_TABLE_ITEM((x) + 2), RECIPROCAL_TABLE_ITEM((x) + 3)
	static inline du_int udiv128by64to64default(du_int u1, du_int u0, du_int v,
	du_int *r) {			#define REPEAT32(x) \
	const unsigned n_udword_bits = sizeof(du_int) * CHAR_BIT;			REPEAT4((x) + 4 * 0), REPEAT4((x) + 4 * 1), REPEAT4((x) + 4 * 2), \
	const du_int b = (1ULL << (n_udword_bits / 2)); // Number base (32 bits)			REPEAT4((x) + 4 * 3), REPEAT4((x) + 4 * 4), REPEAT4((x) + 4 * 5), \
	du_int un1, un0; // Norm. dividend LSD's			REPEAT4((x) + 4 * 6), REPEAT4((x) + 4 * 7)
	du_int vn1, vn0; // Norm. divisor digits
	du_int q1, q0; // Quotient digits			#define REPEAT256() \
	du_int un64, un21, un10; // Dividend digit pairs			REPEAT32(32 * 0), REPEAT32(32 * 1), REPEAT32(32 * 2), REPEAT32(32 * 3), \
	du_int rhat; // A remainder			REPEAT32(32 * 4), REPEAT32(32 * 5), REPEAT32(32 * 6), REPEAT32(32 * 7)
	si_int s; // Shift amount for normalization
				// Reciprocal lookup table of 512 bytes.
	s = __builtin_clzll(v);			static unsigned short kReciprocalTable[] = {REPEAT256()};
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'kReciprocalTable' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'kReciprocalTable' [readability-identifier…
	if (s > 0) {
	// Normalize the divisor.			// Computes the reciprocal (2^128 - 1) / d - 2^64 for normalized d.
	v = v << s;			// Based on Algorithm 2 from "Improved division by invariant integers".
	un64 = (u1 << s) \| (u0 >> (n_udword_bits - s));			static inline du_int reciprocal2by1(du_int d) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'd' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'd' [readability-identifier-naming]…
	un10 = u0 << s; // Shift dividend left
	} else {			const du_int d9 = d >> 55;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'd9' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'd9' [readability-identifier-naming]…
	// Avoid undefined behavior of (u0 >> 64).			const su_int v0 = kReciprocalTable[d9 - 256];
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'v0' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'v0' [readability-identifier-naming]…
	un64 = u1;
	un10 = u0;			const du_int d40 = (d >> 24) + 1;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'd40' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'd40' [readability-identifier-naming]…
	}			const du_int v1 = (v0 << 11) - (su_int)(v0 * v0 * d40 >> 40) - 1;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'v1' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'v1' [readability-identifier-naming]…

	// Break divisor up into two 32-bit digits.			const du_int v2 = (v1 << 13) + (v1 * (0x1000000000000000 - v1 * d40) >> 47);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'v2' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'v2' [readability-identifier-naming]…
	vn1 = v >> (n_udword_bits / 2);
	vn0 = v & 0xFFFFFFFF;			const du_int d0 = d & 1;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'd0' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'd0' [readability-identifier-naming]…
				const du_int d63 = (d >> 1) + d0;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'd63' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'd63' [readability-identifier-naming]…
	// Break right half of dividend into two digits.			const du_int e = ((v2 >> 1) & (0 - d0)) - v2 * d63;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'e' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'e' [readability-identifier-naming]…
	un1 = un10 >> (n_udword_bits / 2);			const du_int v3 = (((tu_int)(v2)*e) >> 65) + (v2 << 31);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'v3' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'v3' [readability-identifier-naming]…
	un0 = un10 & 0xFFFFFFFF;
				const du_int v4 = v3 - ((((tu_int)(v3)*d) + d) >> 64) - d;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'v4' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'v4' [readability-identifier-naming]…
	// Compute the first quotient digit, q1.			return v4;
	q1 = un64 / vn1;			}
	rhat = un64 - q1 * vn1;
				// Based on Algorithm 2 from "Improved division by invariant integers".
	// q1 has at most error 2. No more than 2 iterations.			static inline du_int udivrem2by1(utwords dividend, du_int divisor,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'dividend' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'divisor' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'dividend' [readability-identifier…
	while (q1 >= b \|\| q1 * vn0 > b * rhat + un1) {			du_int reciprocal, du_int *remainder) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'reciprocal' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'remainder' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'reciprocal' [readability-identifier…
	q1 = q1 - 1;			utwords quotient;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'quotient' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'quotient' [readability-identifier-naming]…
	rhat = rhat + vn1;			quotient.all = (tu_int)(reciprocal)*dividend.s.high;
	if (rhat >= b)			quotient.all += dividend.all;
	break;
	}			++quotient.s.high;

	un21 = un64 * b + un1 - q1 * v;			remainder = dividend.s.low - quotient.s.high divisor;

	// Compute the second quotient digit.			if (*remainder > quotient.s.low) {
	q0 = un21 / vn1;			--quotient.s.high;
	rhat = un21 - q0 * vn1;			*remainder += divisor;
				}
	// q0 has at most error 2. No more than 2 iterations.
	while (q0 >= b \|\| q0 * vn0 > b * rhat + un0) {			if (*remainder >= divisor) {
	q0 = q0 - 1;			++quotient.s.high;
	rhat = rhat + vn1;			*remainder -= divisor;
	if (rhat >= b)			}
	break;
	}			return quotient.s.high;

	r = (un21 b + un0 - q0 * v) >> s;
	return q1 * b + q0;
	}

	static inline du_int udiv128by64to64(du_int u1, du_int u0, du_int v,
	du_int *r) {
	#if defined(__x86_64__)
	du_int result;
	__asm__("divq %[v]"
	: "=a"(result), "=d"(*r)
	: [ v ] "r"(v), "a"(u0), "d"(u1));
	return result;
	#else
	return udiv128by64to64default(u1, u0, v, r);
	#endif
	}			}

	// Effects: if rem != 0, *rem = a % b			// Effects: if rem != 0, *rem = a % b
	// Returns: a / b			// Returns: a / b

	COMPILER_RT_ABI tu_int __udivmodti4(tu_int a, tu_int b, tu_int *rem) {			COMPILER_RT_ABI tu_int __udivmodti4(tu_int a, tu_int b, tu_int *rem) {
	const unsigned n_utword_bits = sizeof(tu_int) * CHAR_BIT;			const unsigned n_utword_bits = sizeof(tu_int) * CHAR_BIT;
	utwords dividend;			utwords dividend;
	dividend.all = a;			dividend.all = a;
	utwords divisor;			utwords divisor;
	divisor.all = b;			divisor.all = b;
	utwords quotient;			utwords quotient;
	utwords remainder;			utwords remainder;
	if (divisor.all > dividend.all) {			if (divisor.all > dividend.all) {
	if (rem)			if (rem)
	*rem = dividend.all;			*rem = dividend.all;
	return 0;			return 0;
	}			}
	// When the divisor fits in 64 bits, we can use an optimized path.			// When the divisor fits in 64 bits, we can use an optimized path.
	if (divisor.s.high == 0) {			if (divisor.s.high == 0) {
				const du_int left_shift = __builtin_clzll(divisor.s.low);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'left_shift' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'left_shift' [readability-identifier…
				const du_int right_shift = (64 - left_shift) % 64;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'right_shift' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'right_shift' [readability-identifier…
				const du_int right_shift_mask = (du_int)(left_shift == 0) - 1;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'right_shift_mask' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'right_shift_mask' [readability-identifier…
				utwords normalized_quotient;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'normalized_quotient' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'normalized_quotient' [readability…
				normalized_quotient.s.low =
				(dividend.s.high << left_shift) \|
				((dividend.s.low >> right_shift) & right_shift_mask);
				normalized_quotient.s.high =
				(dividend.s.high >> right_shift) & right_shift_mask;
				const du_int normalized_divisor = divisor.s.low << left_shift;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'normalized_divisor' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'normalized_divisor' [readability…
				const du_int reciprocal = reciprocal2by1(normalized_divisor);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'reciprocal' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'reciprocal' [readability-identifier…
				du_int normalized_remainder;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'normalized_remainder' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'normalized_remainder' [readability…
				quotient.s.high = udivrem2by1(normalized_quotient, normalized_divisor,
				reciprocal, &normalized_remainder);
				normalized_quotient.s.high = normalized_remainder;
				normalized_quotient.s.low = dividend.s.low << left_shift;
	remainder.s.high = 0;			remainder.s.high = 0;
	if (dividend.s.high < divisor.s.low) {			quotient.s.low = udivrem2by1(normalized_quotient, normalized_divisor,
	// The result fits in 64 bits.			reciprocal, &remainder.s.low);
	quotient.s.low = udiv128by64to64(dividend.s.high, dividend.s.low,			remainder.s.low >>= left_shift;
	divisor.s.low, &remainder.s.low);
	quotient.s.high = 0;
	} else {
	// First, divide with the high part to get the remainder in dividend.s.high.
	// After that dividend.s.high < divisor.s.low.
	quotient.s.high = dividend.s.high / divisor.s.low;
	dividend.s.high = dividend.s.high % divisor.s.low;
	quotient.s.low = udiv128by64to64(dividend.s.high, dividend.s.low,
	divisor.s.low, &remainder.s.low);
	}
	if (rem)			if (rem)
	*rem = remainder.all;			*rem = remainder.all;
	return quotient.all;			return quotient.all;
	}			}
	// 0 <= shift <= 63.			// 0 <= shift <= 63.
	si_int shift =			si_int shift =
	__builtin_clzll(divisor.s.high) - __builtin_clzll(dividend.s.high);			__builtin_clzll(divisor.s.high) - __builtin_clzll(dividend.s.high);
	divisor.all <<= shift;			divisor.all <<= shift;
	Show All 22 Lines