Download Raw Diff

Details

Reviewers

sivachandra
michaelrj
cqlauter
zimmermann6

Commits

rGf1ec99f973bd: [libc] Improve hypotf performance with different algorithm correctly rounded to…

Summary

Algorithm for hypotf: compute (a*a + b*b) in double precision, then use Dekker's algorithm to find the rounding error, and then correcting it after taking its square-root.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lntue created this revision.Jan 25 2022, 8:52 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 25 2022, 8:52 AM

Herald added subscribers: ecnelises, tschuett. · View Herald Transcript

lntue requested review of this revision.Jan 25 2022, 8:52 AM

sivachandra added inline comments.Jan 25 2022, 9:19 AM

libc/src/math/generic/hypotf.cpp
23	Incorrect variable naming style at many places in this function.
33	A call to a target independent builtin for a standard function can lead to a call back to the libc. That is, compilers are free to call the `sqrt` function from the libc. We can refactor our `sqrt` implementation so that we can replace this call to `__builtin_sqrt` with a call to LLVM libc's `sqrt`.

lntue added inline comments.Jan 25 2022, 12:20 PM

libc/src/math/generic/hypotf.cpp
33	I refactor our `sqrt` implementation in https://reviews.llvm.org/D118173 Will wait for that patch to be landed.

I get some errors for rounding to nearest:

Difference for 0x1.faf49ep+25,0x1.480002p+23
llvm_hypot: 0x1.00c5bp+26
as_hypot:   0x1.00c5b2p+26
pz_hypot:   0x1.00c5b2p+26

libc/src/math/generic/hypotf.cpp
30–31	I hadn't seen that trick to compute the rounding error, do you have a reference? By the way, I'm not sure the reference to Dekker is appropriate. For me, Dekker's algorithm splits two floating-point numbers in two each, and computes their product (high + low part) using 4 multiplies.

This revision now requires changes to proceed.Jan 26 2022, 6:07 AM

vinc17 added a subscriber: vinc17.Jan 26 2022, 6:36 AM

vinc17 added inline comments.

libc/src/math/generic/hypotf.cpp
30–31	If I understand correctly, `err` should get the rounding error of the sum. The algorithm is known as TwoSum. It needs 6 operations, including the sum `sumSq`, and this is the same number of operations as you have. But with 6 add/sub operations, I proved that there is only one algorithm (up to obvious symmetries) that works. And the above one is different. Thus it will be sometimes incorrect. I think that if `xSq` and `ySq` are close to each other and their sum is not exact, then the above algorithm will give you twice the rounding error.

Remove the check for non-zero error.

In D118157#3272333, @zimmermann6 wrote:
I get some errors for rounding to nearest:
Difference for 0x1.faf49ep+25,0x1.480002p+23
llvm_hypot: 0x1.00c5bp+26
as_hypot:   0x1.00c5b2p+26
pz_hypot:   0x1.00c5b2p+26

Thanks Paul for finding the error! For this example, the sum square is actually exact, but the rounding errors still need to be updated in order to avoid double rounding errors. By removing the check (err != 0) in line 36, I got it back correctly.

lntue marked an inline comment as not done.Jan 26 2022, 8:18 AM

lntue added inline comments.

libc/src/math/generic/hypotf.cpp
30–31	Thanks Paul and Vincent for finding the issue with this! Actually I was trying to implement the Fast2Sum version since we are in radix-2. Moreover, since both `xSq` and `ySq` are non-negative: max(xSq, ySq) <= sumSq <= sqrt(2) max(xSq, ySq) and so `sumSq - max(xSq, ySq)` is exact. I was trying to implement it without branch and thought that `(sumSq - min(xSq, ySq)) - max(xSq, ySq) = 0`, which could be easily disproved by the following example: Consider single precision with `xSq = 1 + 2^(-23)` (I know it's not a square) and `ySq = 2^(-24)` with default rounding mode, then `sumSq = xSq + ySq = 1 + 2^(-22)`. Then: sumSq - xSq = 2^(-23) and sumSq - ySq = 1 + 2^(-22) = sumSq And hence: (sumSq - xSq) - ySq = 2^-24 (sumSq - ySq) - xSq = 2^-23 ((sumSq - xSq) - ySq) + ((sumSq - ySq) - xSq) != 2^(-24) which is the rounding error. I change it back to the normal Fast2Sum implementation with branching now, so the rounding error computation should be correct.

Harbormaster completed remote builds in B145747: Diff 403268.Jan 27 2022, 2:31 AM

I'm still running semi-exhaustive tests, it takes some time. I wonder whether a full exhaustive test is possible, by comparing the LLVM implementation with the code from Alexei at https://core-math.gitlabpages.inria.fr/. On a 64-core machine (Intel Xeon Gold 6130 @ 2.10GHz), it takes 4.6s to check 2^33 pairs (x,y). If one tests only positive x,y and x>=y, as exhaustive comparison would have to check 2^61 pairs for each rounding mode, which would take less than 1.5 month using 10000 such machines. This would not be a proof, but the probability that both codes are wrong for the same inputs and give exactly the same wrong answer is quite small.

In D118157#3278638, @zimmermann6 wrote:

I'm still running semi-exhaustive tests, it takes some time. I wonder whether a full exhaustive test is possible, by comparing the LLVM implementation with the code from Alexei at https://core-math.gitlabpages.inria.fr/. On a 64-core machine (Intel Xeon Gold 6130 @ 2.10GHz), it takes 4.6s to check 2^33 pairs (x,y). If one tests only positive x,y and x>=y, as exhaustive comparison would have to check 2^61 pairs for each rounding mode, which would take less than 1.5 month using 10000 such machines. This would not be a proof, but the probability that both codes are wrong for the same inputs and give exactly the same wrong answer is quite small.

Or if you don't mind to be slower, you can compare it with the shift-and-add algorithm implemented in the LLVM-libc that this one is trying to speed up, since that can be proved mathematically to be correct.

Another option is that since the idea of this algorithm is scalable, we can have a version of it for half precision (essentially just change the data types and masks/constants), where we it can be tested exhaustively? That should at least increase the confidence with single, and maybe double precision later?

Use fputil::sqrt instead of __builtin_sqrtf.

Herald added a subscriber: mgorny. · View Herald TranscriptJan 28 2022, 3:14 PM

Harbormaster completed remote builds in B146393: Diff 404181.Jan 28 2022, 3:19 PM

Fix variable names and ignore compiler warnings about C++17.

Add a quick return when the exponent difference is at least 2 more than the mantissa length.

Harbormaster completed remote builds in B146508: Diff 404339.Jan 29 2022, 9:53 PM

lntue marked an inline comment as done.Jan 30 2022, 6:38 AM

the version of last Friday is fine for me: I did run exhaustive tests for 2^23 <= y < 2^24, and 2^(23+k) <= x < 2^(24+k) for 0 <= k <= 13.
However since it changed in the meantime, I don't have resources any more to review the new version.

In D118157#3283205, @zimmermann6 wrote:

the version of last Friday is fine for me: I did run exhaustive tests for 2^23 <= y < 2^24, and 2^(23+k) <= x < 2^(24+k) for 0 <= k <= 13.
However since it changed in the meantime, I don't have resources any more to review the new version.

Thanks Paul for checking! The new version only changes when exponent of x >= exponent of y + 25, so your verification should still hold.

Add exhaustive testing to test for inputs (x, y) with 2^23 <= x < 2^24, 2^(23 + 14) <= y < 2^(23 + 25).

Fix names in the exhaustive test.

Harbormaster completed remote builds in B146718: Diff 404641.Jan 31 2022, 11:44 AM

OK for the structuring aspects.

This revision is now accepted and ready to land.Jan 31 2022, 3:12 PM

In D118157#3283205, @zimmermann6 wrote:

the version of last Friday is fine for me: I did run exhaustive tests for 2^23 <= y < 2^24, and 2^(23+k) <= x < 2^(24+k) for 0 <= k <= 13.
However since it changed in the meantime, I don't have resources any more to review the new version.

@zimmermann6 : I've finished testing the remaining pairs: (x, y) with 2^23 <= y < 2^24, and 2^(23+k) <= x < 2^(24+k) for 14 <= k <= 24.

Sync to HEAD.

Harbormaster completed remote builds in B149322: Diff 408320.Feb 13 2022, 8:27 PM

Closed by commit rGf1ec99f973bd: [libc] Improve hypotf performance with different algorithm correctly rounded to… (authored by lntue). · Explain WhyFeb 16 2022, 6:49 AM

This revision was automatically updated to reflect the committed changes.

lntue added a commit: rGf1ec99f973bd: [libc] Improve hypotf performance with different algorithm correctly rounded to….

Diff 404181

libc/src/math/generic/CMakeLists.txt

	Show First 20 Lines • Show All 967 Lines • ▼ Show 20 Lines
	add_entrypoint_object(			add_entrypoint_object(
	hypotf			hypotf
	SRCS			SRCS
	hypotf.cpp			hypotf.cpp
	HDRS			HDRS
	../hypotf.h			../hypotf.h
	DEPENDS			DEPENDS
	libc.src.__support.FPUtil.fputil			libc.src.__support.FPUtil.fputil
				libc.src.__support.FPUtil.sqrt
	COMPILE_OPTIONS			COMPILE_OPTIONS
	-O3			-O3
	)			)

	add_entrypoint_object(			add_entrypoint_object(
	fdim			fdim
	SRCS			SRCS
	fdim.cpp			fdim.cpp
	▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

libc/src/math/generic/hypotf.cpp

	//===-- Implementation of hypotf function ---------------------------------===//			//===-- Implementation of hypotf function ---------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#include "src/math/hypotf.h"			#include "src/math/hypotf.h"
	#include "src/__support/FPUtil/Hypot.h"			#include "src/__support/FPUtil/FPBits.h"
				#include "src/__support/FPUtil/sqrt.h"
	#include "src/__support/common.h"			#include "src/__support/common.h"

	namespace __llvm_libc {			namespace __llvm_libc {

	LLVM_LIBC_FUNCTION(float, hypotf, (float x, float y)) {			LLVM_LIBC_FUNCTION(float, hypotf, (float x, float y)) {
	return __llvm_libc::fputil::hypot(x, y);			using DoubleBits = fputil::FPBits<double>;
				using FPBits = fputil::FPBits<float>;

				double xd = static_cast<double>(x);
				double yd = static_cast<double>(y);

				// These squares are exact.
				double xSq = xd * xd;
				sivachandraUnsubmitted Done Reply Inline Actions Incorrect variable naming style at many places in this function. sivachandra: Incorrect variable naming style at many places in this function.
				double ySq = yd * yd;

				// Compute the sum of squares.
				double sumSq = xSq + ySq;

				// Compute the rounding error with Fast2Sum algorithm:
				// xSq + ySq = sumSq - err
				double err = (xSq >= ySq) ? (sumSq - xSq) - ySq : (sumSq - ySq) - xSq;
				zimmermann6Unsubmitted Not Done Reply Inline Actions I hadn't seen that trick to compute the rounding error, do you have a reference? By the way, I'm not sure the reference to Dekker is appropriate. For me, Dekker's algorithm splits two floating-point numbers in two each, and computes their product (high + low part) using 4 multiplies. zimmermann6: I hadn't seen that trick to compute the rounding error, do you have a reference? By the way…
				vinc17Unsubmitted Not Done Reply Inline Actions If I understand correctly, `err` should get the rounding error of the sum. The algorithm is known as TwoSum. It needs 6 operations, including the sum `sumSq`, and this is the same number of operations as you have. But with 6 add/sub operations, I proved that there is only one algorithm (up to obvious symmetries) that works. And the above one is different. Thus it will be sometimes incorrect. I think that if `xSq` and `ySq` are close to each other and their sum is not exact, then the above algorithm will give you twice the rounding error. vinc17: If I understand correctly, `err` should get the rounding error of the sum. The algorithm is…
				lntueAuthorUnsubmitted Done Reply Inline Actions Thanks Paul and Vincent for finding the issue with this! Actually I was trying to implement the Fast2Sum version since we are in radix-2. Moreover, since both `xSq` and `ySq` are non-negative: max(xSq, ySq) <= sumSq <= sqrt(2) max(xSq, ySq) and so `sumSq - max(xSq, ySq)` is exact. I was trying to implement it without branch and thought that `(sumSq - min(xSq, ySq)) - max(xSq, ySq) = 0`, which could be easily disproved by the following example: Consider single precision with `xSq = 1 + 2^(-23)` (I know it's not a square) and `ySq = 2^(-24)` with default rounding mode, then `sumSq = xSq + ySq = 1 + 2^(-22)`. Then: sumSq - xSq = 2^(-23) and sumSq - ySq = 1 + 2^(-22) = sumSq And hence: (sumSq - xSq) - ySq = 2^-24 (sumSq - ySq) - xSq = 2^-23 ((sumSq - xSq) - ySq) + ((sumSq - ySq) - xSq) != 2^(-24) which is the rounding error. I change it back to the normal Fast2Sum implementation with branching now, so the rounding error computation should be correct. lntue: Thanks Paul and Vincent for finding the issue with this! Actually I was trying to implement…

				// Take sqrt in double precision.
				sivachandraUnsubmitted Done Reply Inline Actions A call to a target independent builtin for a standard function can lead to a call back to the libc. That is, compilers are free to call the `sqrt` function from the libc. We can refactor our `sqrt` implementation so that we can replace this call to `__builtin_sqrt` with a call to LLVM libc's `sqrt`. sivachandra: A call to a target independent builtin for a standard function can lead to a call back to the…
				lntueAuthorUnsubmitted Done Reply Inline Actions I refactor our `sqrt` implementation in https://reviews.llvm.org/D118173 Will wait for that patch to be landed. lntue: I refactor our `sqrt` implementation in https://reviews.llvm.org/D118173 Will wait for that…
				DoubleBits result(fputil::sqrt(sumSq));

				if (!DoubleBits(sumSq).is_inf_or_nan()) {
				// Correct rounding.
				double rSq = static_cast<double>(result) * static_cast<double>(result);
				double diff = sumSq - rSq;
				constexpr uint64_t mask = 0x0000'0000'3FFF'FFFFULL;
				uint64_t lrs = result.uintval() & mask;

				if (lrs == 0x0000'0000'1000'0000ULL && err < diff) {
				result.bits \|= 1ULL;
				} else if (lrs == 0x0000'0000'3000'0000ULL && err > diff) {
				result.bits -= 1ULL;
				}
				} else {
				FPBits bits_x(x), bits_y(y);
				if (bits_x.is_inf_or_nan() \|\| bits_y.is_inf_or_nan()) {
				if (bits_x.is_inf() \|\| bits_y.is_inf())
				return static_cast<float>(FPBits::inf());
				if (bits_x.is_nan())
				return x;
				return y;
				}
				}

				return static_cast<float>(static_cast<double>(result));
	}			}

	} // namespace __llvm_libc			} // namespace __llvm_libc

libc/test/src/math/hypotf_hard_to_round.h

	//===-- Hard-to-round inputs for hypotf ------------------------------C++--===//			//===-- Hard-to-round inputs for hypotf ------------------------------C++--===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_LIBC_TEST_SRC_MATH_HYPOTTEST_HARD_TO_ROUND_H			#ifndef LLVM_LIBC_TEST_SRC_MATH_HYPOTTEST_HARD_TO_ROUND_H
	#define LLVM_LIBC_TEST_SRC_MATH_HYPOTTEST_HARD_TO_ROUND_H			#define LLVM_LIBC_TEST_SRC_MATH_HYPOTTEST_HARD_TO_ROUND_H

	#include "utils/MPFRWrapper/MPFRUtils.h"			#include "utils/MPFRWrapper/MPFRUtils.h"

	namespace mpfr = __llvm_libc::testing::mpfr;			namespace mpfr = __llvm_libc::testing::mpfr;

	constexpr int N_HARD_TO_ROUND = 1216;			constexpr int N_HARD_TO_ROUND = 1217;
	constexpr mpfr::BinaryInput<float> HYPOTF_HARD_TO_ROUND[N_HARD_TO_ROUND] = {			constexpr mpfr::BinaryInput<float> HYPOTF_HARD_TO_ROUND[N_HARD_TO_ROUND] = {
				{0x1.faf49ep+25f, 0x1.480002p+23f},
	{0x1.ffffecp-1f, 0x1.000002p+27},			{0x1.ffffecp-1f, 0x1.000002p+27},
	{0x1.900004p+34, 0x1.400002p+23}, /* 45 identical bits */			{0x1.900004p+34, 0x1.400002p+23}, /* 45 identical bits */
	{0x1.05555p+34, 0x1.bffffep+23}, /* 44 identical bits */			{0x1.05555p+34, 0x1.bffffep+23}, /* 44 identical bits */
	{0x1.e5fffap+34, 0x1.affffep+23}, /* 45 identical bits */			{0x1.e5fffap+34, 0x1.affffep+23}, /* 45 identical bits */
	{0x1.260002p+34, 0x1.500002p+23}, /* 45 identical bits */			{0x1.260002p+34, 0x1.500002p+23}, /* 45 identical bits */
	{0x1.fffffap+34, 0x1.fffffep+23}, /* 45 identical bits */			{0x1.fffffap+34, 0x1.fffffep+23}, /* 45 identical bits */
	{0x1.8ffffap+34, 0x1.3ffffep+23}, /* 45 identical bits */			{0x1.8ffffap+34, 0x1.3ffffep+23}, /* 45 identical bits */
	{0x1.87fffcp+35, 0x1.bffffep+23}, /* 47 identical bits */			{0x1.87fffcp+35, 0x1.bffffep+23}, /* 47 identical bits */
	▲ Show 20 Lines • Show All 1,213 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Improve hypotf performance with different algorithm correctly rounded to all rounding modes.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 404181

libc/src/math/generic/CMakeLists.txt

libc/src/math/generic/hypotf.cpp

libc/test/src/math/hypotf_hard_to_round.h

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Improve hypotf performance with different algorithm correctly rounded to all rounding modes.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 404181

libc/src/math/generic/CMakeLists.txt

libc/src/math/generic/hypotf.cpp

libc/test/src/math/hypotf_hard_to_round.h

[libc] Improve hypotf performance with different algorithm correctly rounded to all rounding modes.
ClosedPublic