Download Raw Diff

Details

Reviewers

sivachandra
michaelrj

Commits

rGcdf6a581b927: [libc] Use intrinsics for x86-64 fma and optimize PolyEval for x86-64 with…

Summary

Use intrinsics for x86-64 fma
Optimize PolyEval for x86-64 with degree 3 & 5 polynomials.
There might be a slight loss of accuracy compared to Horner's scheme due to usages of higher powers x^2 and x^3 in the computations.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lntue created this revision.Dec 8 2021, 7:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 8 2021, 7:31 AM

Herald added subscribers: ecnelises, tschuett, pengfei, mgorny. · View Herald Transcript

lntue requested review of this revision.Dec 8 2021, 7:31 AM

Harbormaster completed remote builds in B138159: Diff 392768.Dec 8 2021, 7:42 AM

The subject line and description are confusing. It seems to me like you have added specializations for 3rd and 5th degree polynomials. But, the description has "3-6 polynomials".

libc/src/__support/FPUtil/PolyEval.h
36	Couple of things: We want individual header files to be self-contained. So, this include should be moved to the header files where it is required. Because of the same self-contained headers requirement, the entities defined in a header file should be in the `__llvm_libc` namespace explicitly as opposed to ending up in a namespace when included like this here. So, these includes should be moved outside of the namespace here.
libc/src/__support/FPUtil/x86_64/PolyEvalDouble.h
1 ↗	(On Diff #392768)	Fix line length but not sure if the description is correct also.
libc/src/__support/FPUtil/x86_64/PolyEvalFloat.h
1 ↗	(On Diff #392768)	Fix line length but not sure if the description is correct also.
libc/src/math/generic/CMakeLists.txt
503 ↗	(On Diff #392768)	Can you add a comment explaining what in the implementation would be affected by this? AFAICT, we are either using the fma instructions directly, or are calling the fma related builtins. So, it is not clear to me as to why this should be required.
libc/test/src/math/expm1f_test.cpp
111	If there is a possibility of trading off between performance and accuracy, we should provide build time switches for users to pick one or the other based on their needs. By default we want to "err" on the side of providing more accurate implementations over providing faster but less accurate implementations.

sivachandra added inline comments.Dec 8 2021, 11:16 PM

libc/src/math/generic/CMakeLists.txt
503 ↗	(On Diff #392768)	So I learn't that the fma related builtins require this option. The correct way to do this would be to do it this way: Make a separate helper library of x86 builtin calls. List `-mfma` as an `INTERFACE` compile option for that library. Then targets depending on the helper library will automatically get the `-mfma` compile option.

lntue marked 3 inline comments as done.Dec 9 2021, 7:26 AM

lntue added inline comments.

libc/test/src/math/expm1f_test.cpp
111	We're going to have a correctly rounded version for expm1f soon so I'm not worried about this regression yet.

[libc] Optimize PolyEval for x86-64 with degree 3-6 polynomials.

Harbormaster completed remote builds in B138451: Diff 393163.Dec 9 2021, 8:15 AM

[libc] Optimize PolyEval for x86-64 with degree 3-6 polynomials.

Harbormaster completed remote builds in B138494: Diff 393235.Dec 9 2021, 11:54 AM

[libc] Optimize PolyEval for x86-64 with degree 3-6 polynomials.

Is the subject line and description correct?

libc/src/__support/FPUtil/PolyEval.h
40	Can these be in just one file say, `PolyEvalSpecializations.h`?
libc/test/src/math/expm1f_test.cpp
111	Do we know why the accuracy dropped? If yes, can we add the reason to the commit description?

Harbormaster completed remote builds in B138516: Diff 393268.Dec 9 2021, 1:40 PM

[libc] Use intrinsic for x86-64 fma and optimize PolyEval for x86-64 with degree 3 & 5 polynomials.

lntue retitled this revision from [libc] Optimize PolyEval for x86-64 with degree 3-6 polynomials. to [libc] Use intrinsics for x86-64 fma and optimize PolyEval for x86-64 with degree 3 & 5 polynomials..Dec 9 2021, 2:49 PM

lntue edited the summary of this revision. (Show Details)

lntue marked an inline comment as done.

lntue added inline comments.

libc/test/src/math/expm1f_test.cpp
111	Added a possible reason to the patch's summary.

Harbormaster completed remote builds in B138532: Diff 393290.Dec 9 2021, 2:52 PM

[libc] Use intrinsics for x86-64 fma and optimize PolyEval for x86-64 with degree 3 & 5 polynomials.

Harbormaster completed remote builds in B138535: Diff 393293.Dec 9 2021, 2:59 PM

sivachandra accepted this revision.Dec 9 2021, 3:25 PM

This revision is now accepted and ready to land.Dec 9 2021, 3:25 PM

Closed by commit rGcdf6a581b927: [libc] Use intrinsics for x86-64 fma and optimize PolyEval for x86-64 with… (authored by lntue). · Explain WhyDec 9 2021, 3:34 PM

This revision was automatically updated to reflect the committed changes.

lntue added a commit: rGcdf6a581b927: [libc] Use intrinsics for x86-64 fma and optimize PolyEval for x86-64 with….

lntue mentioned this in D115408: [libc] Implement correctly rounded logf based on RLIBM library..Dec 9 2021, 4:08 PM

Diff 393305

libc/src/__support/FPUtil/PolyEval.h

	Show All 27 Lines
	template <typename T> static inline T polyeval(T x, T a0) { return a0; }			template <typename T> static inline T polyeval(T x, T a0) { return a0; }

	template <typename T, typename... Ts>			template <typename T, typename... Ts>
	static inline T polyeval(T x, T a0, Ts... a) {			static inline T polyeval(T x, T a0, Ts... a) {
	return fma(x, polyeval(x, a...), a0);			return fma(x, polyeval(x, a...), a0);
	}			}

	} // namespace fputil			} // namespace fputil
	} // namespace __llvm_libc			} // namespace __llvm_libc
				sivachandraUnsubmitted Done Reply Inline Actions Couple of things: We want individual header files to be self-contained. So, this include should be moved to the header files where it is required. Because of the same self-contained headers requirement, the entities defined in a header file should be in the `__llvm_libc` namespace explicitly as opposed to ending up in a namespace when included like this here. So, these includes should be moved outside of the namespace here. sivachandra: Couple of things: 1. We want individual header files to be self-contained. So, this include…

				#ifdef LLVM_LIBC_ARCH_X86_64

				#include "x86_64/PolyEval.h"
				sivachandraUnsubmitted Done Reply Inline Actions Can these be in just one file say, `PolyEvalSpecializations.h`? sivachandra: Can these be in just one file say, `PolyEvalSpecializations.h`?

				#endif // LLVM_LIBC_ARCH_X86_64

	#else			#else

	namespace __llvm_libc {			namespace __llvm_libc {
	namespace fputil {			namespace fputil {

	template <typename T> static inline T polyeval(T x, T a0) { return a0; }			template <typename T> static inline T polyeval(T x, T a0) { return a0; }

	template <typename T, typename... Ts>			template <typename T, typename... Ts>
	Show All 10 Lines

libc/src/__support/FPUtil/x86_64/FMA.h

	Show All 10 Lines

	#include "src/__support/architectures.h"			#include "src/__support/architectures.h"

	#if !defined(LLVM_LIBC_ARCH_X86)			#if !defined(LLVM_LIBC_ARCH_X86)
	#error "Invalid include"			#error "Invalid include"
	#endif			#endif

	#include "src/__support/CPP/TypeTraits.h"			#include "src/__support/CPP/TypeTraits.h"
				#include <immintrin.h>

	namespace __llvm_libc {			namespace __llvm_libc {
	namespace fputil {			namespace fputil {

	template <typename T>			template <typename T>
	static inline cpp::EnableIfType<cpp::IsSame<T, float>::Value, T> fma(T x, T y,			__attribute__((target(
	T z) {			"fma"))) static inline cpp::EnableIfType<cpp::IsSame<T, float>::Value, T>
	float result = x;			fma(T x, T y, T z) {
	__asm__ __volatile__("vfmadd213ss %x2, %x1, %x0"			float result;
	: "+x"(result)			__m128 xmm = _mm_load_ss(&x);
	: "x"(y), "x"(z));			__m128 ymm = _mm_load_ss(&y);
				__m128 zmm = _mm_load_ss(&z);
				__m128 r = _mm_fmadd_ss(xmm, ymm, zmm);
				_mm_store_ss(&result, r);
	return result;			return result;
	}			}

	template <typename T>			template <typename T>
	static inline cpp::EnableIfType<cpp::IsSame<T, double>::Value, T> fma(T x, T y,			__attribute__((target(
	T z) {			"fma"))) static inline cpp::EnableIfType<cpp::IsSame<T, double>::Value, T>
	double result = x;			fma(T x, T y, T z) {
	__asm__ __volatile__("vfmadd213sd %x2, %x1, %x0"			double result;
	: "+x"(result)			__m128d xmm = _mm_load_sd(&x);
	: "x"(y), "x"(z));			__m128d ymm = _mm_load_sd(&y);
				__m128d zmm = _mm_load_sd(&z);
				__m128d r = _mm_fmadd_sd(xmm, ymm, zmm);
				_mm_store_sd(&result, r);
	return result;			return result;
	}			}

	} // namespace fputil			} // namespace fputil
	} // namespace __llvm_libc			} // namespace __llvm_libc

	#endif // LLVM_LIBC_SRC_SUPPORT_FPUTIL_X86_64_FMA_H			#endif // LLVM_LIBC_SRC_SUPPORT_FPUTIL_X86_64_FMA_H

libc/src/__support/FPUtil/x86_64/PolyEval.h

This file was added.

				//===-- Optimized PolyEval implementations for x86_64 --------- C++ -----*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_SUPPORT_FPUTIL_X86_64_POLYEVAL_H
				#define LLVM_LIBC_SRC_SUPPORT_FPUTIL_X86_64_POLYEVAL_H

				#include "src/__support/architectures.h"

				#if !defined(LLVM_LIBC_ARCH_X86_64)
				#error "Invalid include"
				#endif

				#include <immintrin.h>

				namespace __llvm_libc {
				namespace fputil {

				// Cubic polynomials:
				// polyeval(x, a0, a1, a2, a3) = a3x^3 + a2x^2 + a1*x + a0
				template <>
				__attribute__((target("fma"))) inline float
				polyeval(float x, float a0, float a1, float a2, float a3) {
				__m128 xmm = _mm_set1_ps(x);
				__m128 a13 = _mm_set_ps(0.0f, x, a3, a1);
				__m128 a02 = _mm_set_ps(0.0f, 0.0f, a2, a0);
				// r = (0, x^2, a3x + a2, a1x + a0)
				__m128 r = _mm_fmadd_ps(a13, xmm, a02);
				// result = (a3x + a2) x^2 + (a1*x + a0)
				return fma(r[2], r[1], r[0]);
				}

				template <>
				__attribute__((target("fma"))) inline double
				polyeval(double x, double a0, double a1, double a2, double a3) {
				__m256d xmm = _mm256_set1_pd(x);
				__m256d a13 = _mm256_set_pd(0.0, x, a3, a1);
				__m256d a02 = _mm256_set_pd(0.0, 0.0, a2, a0);
				// r = (0, x^2, a3x + a2, a1x + a0)
				__m256d r = _mm256_fmadd_pd(a13, xmm, a02);
				// result = (a3x + a2) x^2 + (a1*x + a0)
				return fma(r[2], r[1], r[0]);
				}

				// Quintic polynomials:
				// polyeval(x, a0, a1, a2, a3, a4, a5) = a5x^5 + a4x^4 + a3x^3 + a2x^2 +
				// + a1*x + a0
				template <>
				__attribute__((target("fma"))) inline float
				polyeval(float x, float a0, float a1, float a2, float a3, float a4, float a5) {
				__m128 xmm = _mm_set1_ps(x);
				__m128 a25 = _mm_set_ps(0.0f, x, a5, a2);
				__m128 a14 = _mm_set_ps(0.0f, 0.0f, a4, a1);
				__m128 a03 = _mm_set_ps(0.0f, 0.0f, a3, a0);
				// r1 = (0, x^2, a5x + a4, a2x + a1)
				__m128 r1 = _mm_fmadd_ps(a25, xmm, a14);
				// r2 = (0, x^3, (a5x + a4)x + a3, (a2x + a1)x + a0
				__m128 r2 = _mm_fmadd_ps(r1, xmm, a03);
				// result = ((a5x + a4)x + a3) * x^3 + ((a2x + a1)x + a0)
				return fma(r2[2], r2[1], r2[0]);
				}

				template <>
				__attribute__((target("fma"))) inline double
				polyeval(double x, double a0, double a1, double a2, double a3, double a4,
				double a5) {
				__m256d xmm = _mm256_set1_pd(x);
				__m256d a25 = _mm256_set_pd(0.0, x, a5, a2);
				__m256d a14 = _mm256_set_pd(0.0, 0.0, a4, a1);
				__m256d a03 = _mm256_set_pd(0.0, 0.0, a3, a0);
				// r1 = (0, x^2, a5x + a4, a2x + a1)
				__m256d r1 = _mm256_fmadd_pd(a25, xmm, a14);
				// r2 = (0, x^3, (a5x + a4)x + a3, (a2x + a1)x + a0
				__m256d r2 = _mm256_fmadd_pd(r1, xmm, a03);
				// result = ((a5x + a4)x + a3) * x^3 + ((a2x + a1)x + a0)
				return fma(r2[2], r2[1], r2[0]);
				}

				} // namespace fputil
				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_SUPPORT_FPUTIL_X86_64_POLYEVAL_H

libc/test/src/math/expm1f_test.cpp

Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	for (uint32_t i = 0, v = 0; i <= count; ++i, v += step) {
float result = __llvm_libc::expm1f(x);		float result = __llvm_libc::expm1f(x);

// If the computation resulted in an error or did not produce valid result		// If the computation resulted in an error or did not produce valid result
// in the single-precision floating point range, then ignore comparing with		// in the single-precision floating point range, then ignore comparing with
// MPFR result as MPFR can still produce valid results because of its		// MPFR result as MPFR can still produce valid results because of its
// wider precision.		// wider precision.
if (isnan(result) \|\| isinf(result) \|\| errno != 0)		if (isnan(result) \|\| isinf(result) \|\| errno != 0)
continue;		continue;
ASSERT_MPFR_MATCH(mpfr::Operation::Expm1, x, __llvm_libc::expm1f(x), 1.5);		ASSERT_MPFR_MATCH(mpfr::Operation::Expm1, x, __llvm_libc::expm1f(x), 2.2);
		sivachandraUnsubmitted Not Done Reply Inline Actions If there is a possibility of trading off between performance and accuracy, we should provide build time switches for users to pick one or the other based on their needs. By default we want to "err" on the side of providing more accurate implementations over providing faster but less accurate implementations. sivachandra: If there is a possibility of trading off between performance and accuracy, we should provide…
		lntueAuthorUnsubmitted Done Reply Inline Actions We're going to have a correctly rounded version for expm1f soon so I'm not worried about this regression yet. lntue: We're going to have a correctly rounded version for expm1f soon so I'm not worried about this…
		sivachandraUnsubmitted Not Done Reply Inline Actions Do we know why the accuracy dropped? If yes, can we add the reason to the commit description? sivachandra: Do we know why the accuracy dropped? If yes, can we add the reason to the commit description?
		lntueAuthorUnsubmitted Done Reply Inline Actions Added a possible reason to the patch's summary. lntue: Added a possible reason to the patch's summary.
}		}
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Use intrinsics for x86-64 fma and optimize PolyEval for x86-64 with degree 3 & 5 polynomials.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 393305

libc/src/__support/FPUtil/PolyEval.h

libc/src/__support/FPUtil/x86_64/FMA.h

libc/src/__support/FPUtil/x86_64/PolyEval.h

libc/test/src/math/expm1f_test.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Use intrinsics for x86-64 fma and optimize PolyEval for x86-64 with degree 3 & 5 polynomials.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 393305

libc/src/__support/FPUtil/PolyEval.h

libc/src/__support/FPUtil/x86_64/FMA.h

libc/src/__support/FPUtil/x86_64/PolyEval.h

libc/test/src/math/expm1f_test.cpp

[libc] Use intrinsics for x86-64 fma and optimize PolyEval for x86-64 with degree 3 & 5 polynomials.
ClosedPublic