This is an archive of the discontinued LLVM Phabricator instance.

[X86] Move 128-bit f16c intrinsics to __emmintrin_f16c.h include from emmintrin.h. Move 256-bit f16c intrinsics back to f16cintrin.h
ClosedPublic

Authored by craig.topper on May 21 2018, 6:38 PM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
echristo
DavidKreitzer
rnk

Commits

rG34c8c0d85882: [X86] Move 128-bit f16c intrinsics to __emmintrin_f16c.h include from emmintrin.
rC333014: [X86] Move 128-bit f16c intrinsics to __emmintrin_f16c.h include from emmintrin.
rL333014: [X86] Move 128-bit f16c intrinsics to __emmintrin_f16c.h include from emmintrin.

Summary

Intel documents the 128-bit versions as being in emmintrin.h and the 256-bit version as being in immintrin.h.

This patch makes a new __emmtrin_f16c.h to hold the 128-bit versions to be included from emmintrin.h. And makes the existing f16cintrin.h contain the 256-bit versions and include it from immintrin.h with an error if its included directly.

Diff Detail

Repository: rL LLVM

Event Timeline

craig.topper created this revision.May 21 2018, 6:38 PM

Harbormaster completed remote builds in B18431: Diff 147932.May 21 2018, 6:40 PM

craig.topper added inline comments.May 21 2018, 6:43 PM

lib/Headers/immintrin.h
72 ↗	(On Diff #147932)	Interesting this to note here, the 256-bit f16c intrinsics were being guarded by AVX2 when MSC_VER was defined and modules weren't supported. This was definitely incorrect.

craig.topper added a reviewer: rnk.May 22 2018, 11:04 AM

lgtm

This revision is now accepted and ready to land.May 22 2018, 11:18 AM

Aren't all the instructions from the same CPUID bit? It seems odd to split them across multiple files.

It is odd, but they really are split in the icc include files. So they got split a while back in clang to match the Intel Intrinsic Guide documentation.

In D47174#1108329, @craig.topper wrote:

It is odd, but they really are split in the icc include files. So they got split a while back in clang to match the Intel Intrinsic Guide documentation.

OK - if that means we're matching latest icc/gcc. LGTM.

Closed by commit rL333014: [X86] Move 128-bit f16c intrinsics to __emmintrin_f16c.h include from emmintrin. (authored by ctopper). · Explain WhyMay 22 2018, 11:59 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptMay 22 2018, 11:59 AM

A bit of history: In icc, the f16<=>f32 conversion intrinsics are a bit of an anomaly in that they can be implemented using either native code or emulation code based on the target architecture switch. See https://godbolt.org/g/bQy7xY (thanks, Craig, for the example code). The emulation code lives in the Intel Math Library.

The reason icc chose to declare the scalar & 128-bit versions of the intrinsics in emmintrin.h rather than a header file that more closely corresponds to the f16c feature is that emmintrin.h contains the minimum necessary to use the emulation code, i.e. the declaration of the __m128i type.

Given that clang doesn't support the lowering of these intrinsics to emulation code, I don't see much benefit including them in emmintrin.h. It would make more sense to just put everything in f16cintrin.h and include that from immintrin.h.

In brief, I like your changes in immintrin.h. I would move the code from _emmintrin_f16c.h into f16cintrin.h. And I would remove the include from emmintrin.h. I think that would be consistent with gcc as well. We can let the emulation behavior of these intrinsics remain an icc-specific anomaly.

Diffusion mentioned this in rL333033: [X86] As mentioned in post-commit feedback in D47174, move the 128 bit f16c….May 22 2018, 3:23 PM

Diffusion mentioned this in rC333033: [X86] As mentioned in post-commit feedback in D47174, move the 128 bit f16c….

Implemented @DavidKreitzer's suggestion in r333033

Revision Contents

Path

Size

cfe/

trunk/

lib/

Headers/

124 lines

2 lines

83 lines

50 lines

Diff 148065

cfe/trunk/lib/Headers/__emmintrin_f16c.h

				/*===---- __emmintrin_f16c.h - F16C intrinsics -----------------------------===
				*
				* Permission is hereby granted, free of charge, to any person obtaining a copy
				* of this software and associated documentation files (the "Software"), to deal
				* in the Software without restriction, including without limitation the rights
				* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
				* copies of the Software, and to permit persons to whom the Software is
				* furnished to do so, subject to the following conditions:
				*
				* The above copyright notice and this permission notice shall be included in
				* all copies or substantial portions of the Software.
				*
				* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
				* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
				* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
				* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
				* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
				* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
				* THE SOFTWARE.
				*
				*===-----------------------------------------------------------------------===
				*/

				#if !defined __EMMINTRIN_H
				#error "Never use <__emmintrin_f16c.h> directly; include <emmintrin.h> instead."
				#endif

				#ifndef __EMMINTRIN_F16C_H
				#define __EMMINTRIN_F16C_H

				/* Define the default attributes for the functions in this file. */
				#define __DEFAULT_FN_ATTRS \
				__attribute__((__always_inline__, __nodebug__, __target__("f16c")))

				/// Converts a 16-bit half-precision float value into a 32-bit float
				/// value.
				///
				/// \headerfile <x86intrin.h>
				///
				/// This intrinsic corresponds to the <c> VCVTPH2PS </c> instruction.
				///
				/// \param __a
				/// A 16-bit half-precision float value.
				/// \returns The converted 32-bit float value.
				static __inline float __DEFAULT_FN_ATTRS
				_cvtsh_ss(unsigned short __a)
				{
				__v8hi v = {(short)__a, 0, 0, 0, 0, 0, 0, 0};
				__v4sf r = __builtin_ia32_vcvtph2ps(v);
				return r[0];
				}

				/// Converts a 32-bit single-precision float value to a 16-bit
				/// half-precision float value.
				///
				/// \headerfile <x86intrin.h>
				///
				/// \code
				/// unsigned short _cvtss_sh(float a, const int imm);
				/// \endcode
				///
				/// This intrinsic corresponds to the <c> VCVTPS2PH </c> instruction.
				///
				/// \param a
				/// A 32-bit single-precision float value to be converted to a 16-bit
				/// half-precision float value.
				/// \param imm
				/// An immediate value controlling rounding using bits [2:0]: \n
				/// 000: Nearest \n
				/// 001: Down \n
				/// 010: Up \n
				/// 011: Truncate \n
				/// 1XX: Use MXCSR.RC for rounding
				/// \returns The converted 16-bit half-precision float value.
				#define _cvtss_sh(a, imm) __extension__ ({ \
				(unsigned short)(((__v8hi)__builtin_ia32_vcvtps2ph((__v4sf){a, 0, 0, 0}, \
				(imm)))[0]); })

				/// Converts a 128-bit vector containing 32-bit float values into a
				/// 128-bit vector containing 16-bit half-precision float values.
				///
				/// \headerfile <x86intrin.h>
				///
				/// \code
				/// __m128i _mm_cvtps_ph(__m128 a, const int imm);
				/// \endcode
				///
				/// This intrinsic corresponds to the <c> VCVTPS2PH </c> instruction.
				///
				/// \param a
				/// A 128-bit vector containing 32-bit float values.
				/// \param imm
				/// An immediate value controlling rounding using bits [2:0]: \n
				/// 000: Nearest \n
				/// 001: Down \n
				/// 010: Up \n
				/// 011: Truncate \n
				/// 1XX: Use MXCSR.RC for rounding
				/// \returns A 128-bit vector containing converted 16-bit half-precision float
				/// values. The lower 64 bits are used to store the converted 16-bit
				/// half-precision floating-point values.
				#define _mm_cvtps_ph(a, imm) __extension__ ({ \
				(__m128i)__builtin_ia32_vcvtps2ph((__v4sf)(__m128)(a), (imm)); })

				/// Converts a 128-bit vector containing 16-bit half-precision float
				/// values into a 128-bit vector containing 32-bit float values.
				///
				/// \headerfile <x86intrin.h>
				///
				/// This intrinsic corresponds to the <c> VCVTPH2PS </c> instruction.
				///
				/// \param __a
				/// A 128-bit vector containing 16-bit half-precision float values. The lower
				/// 64 bits are used in the conversion.
				/// \returns A 128-bit vector of [4 x float] containing converted float values.
				static __inline __m128 __DEFAULT_FN_ATTRS
				_mm_cvtph_ps(__m128i __a)
				{
				return (__m128)__builtin_ia32_vcvtph2ps((__v8hi)__a);
				}

				#undef __DEFAULT_FN_ATTRS

				#endif /* __EMMINTRIN_F16C_H */

cfe/trunk/lib/Headers/emmintrin.h

	Show All 38 Lines
	typedef unsigned long long __v2du __attribute__ ((__vector_size__ (16)));			typedef unsigned long long __v2du __attribute__ ((__vector_size__ (16)));
	typedef unsigned short __v8hu __attribute__((__vector_size__(16)));			typedef unsigned short __v8hu __attribute__((__vector_size__(16)));
	typedef unsigned char __v16qu __attribute__((__vector_size__(16)));			typedef unsigned char __v16qu __attribute__((__vector_size__(16)));

	/* We need an explicitly signed variant for char. Note that this shouldn't			/* We need an explicitly signed variant for char. Note that this shouldn't
	* appear in the interface though. */			* appear in the interface though. */
	typedef signed char __v16qs __attribute__((__vector_size__(16)));			typedef signed char __v16qs __attribute__((__vector_size__(16)));

	#include <f16cintrin.h>			#include <__emmintrin_f16c.h>

	/* Define the default attributes for the functions in this file. */			/* Define the default attributes for the functions in this file. */
	#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("sse2")))			#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("sse2")))

	/// Adds lower double-precision values in both operands and returns the			/// Adds lower double-precision values in both operands and returns the
	/// sum in the lower 64 bits of the result. The upper 64 bits of the result			/// sum in the lower 64 bits of the result. The upper 64 bits of the result
	/// are copied from the upper double-precision value of the first operand.			/// are copied from the upper double-precision value of the first operand.
	///			///
	▲ Show 20 Lines • Show All 4,896 Lines • Show Last 20 Lines

cfe/trunk/lib/Headers/f16cintrin.h

	Show All 15 Lines
	* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER			* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
	* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,			* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
	* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN			* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
	* THE SOFTWARE.			* THE SOFTWARE.
	*			*
	*===-----------------------------------------------------------------------===			*===-----------------------------------------------------------------------===
	*/			*/

	#if !defined __X86INTRIN_H && !defined __EMMINTRIN_H && !defined __IMMINTRIN_H			#if !defined __IMMINTRIN_H
	#error "Never use <f16cintrin.h> directly; include <emmintrin.h> instead."			#error "Never use <f16cintrin.h> directly; include <immintrin.h> instead."
	#endif			#endif

	#ifndef __F16CINTRIN_H			#ifndef __F16CINTRIN_H
	#define __F16CINTRIN_H			#define __F16CINTRIN_H

	/* Define the default attributes for the functions in this file. */			/* Define the default attributes for the functions in this file. */
	#define __DEFAULT_FN_ATTRS \			#define __DEFAULT_FN_ATTRS \
	__attribute__((__always_inline__, __nodebug__, __target__("f16c")))			__attribute__((__always_inline__, __nodebug__, __target__("f16c")))

	/// Converts a 16-bit half-precision float value into a 32-bit float			/* The 256-bit versions of functions in f16cintrin.h.
	/// value.			Intel documents these as being in immintrin.h, and
	///			they depend on typedefs from avxintrin.h. */
	/// \headerfile <x86intrin.h>
	///
	/// This intrinsic corresponds to the <c> VCVTPH2PS </c> instruction.
	///
	/// \param __a
	/// A 16-bit half-precision float value.
	/// \returns The converted 32-bit float value.
	static __inline float __DEFAULT_FN_ATTRS
	_cvtsh_ss(unsigned short __a)
	{
	__v8hi v = {(short)__a, 0, 0, 0, 0, 0, 0, 0};
	__v4sf r = __builtin_ia32_vcvtph2ps(v);
	return r[0];
	}

	/// Converts a 32-bit single-precision float value to a 16-bit
	/// half-precision float value.
	///
	/// \headerfile <x86intrin.h>
	///
	/// \code
	/// unsigned short _cvtss_sh(float a, const int imm);
	/// \endcode
	///
	/// This intrinsic corresponds to the <c> VCVTPS2PH </c> instruction.
	///
	/// \param a
	/// A 32-bit single-precision float value to be converted to a 16-bit
	/// half-precision float value.
	/// \param imm
	/// An immediate value controlling rounding using bits [2:0]: \n
	/// 000: Nearest \n
	/// 001: Down \n
	/// 010: Up \n
	/// 011: Truncate \n
	/// 1XX: Use MXCSR.RC for rounding
	/// \returns The converted 16-bit half-precision float value.
	#define _cvtss_sh(a, imm) __extension__ ({ \
	(unsigned short)(((__v8hi)__builtin_ia32_vcvtps2ph((__v4sf){a, 0, 0, 0}, \
	(imm)))[0]); })

	/// Converts a 128-bit vector containing 32-bit float values into a			/// Converts a 256-bit vector of [8 x float] into a 128-bit vector
	/// 128-bit vector containing 16-bit half-precision float values.			/// containing 16-bit half-precision float values.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// \code			/// \code
	/// __m128i _mm_cvtps_ph(__m128 a, const int imm);			/// __m128i _mm256_cvtps_ph(__m256 a, const int imm);
	/// \endcode			/// \endcode
	///			///
	/// This intrinsic corresponds to the <c> VCVTPS2PH </c> instruction.			/// This intrinsic corresponds to the <c> VCVTPS2PH </c> instruction.
	///			///
	/// \param a			/// \param a
	/// A 128-bit vector containing 32-bit float values.			/// A 256-bit vector containing 32-bit single-precision float values to be
				/// converted to 16-bit half-precision float values.
	/// \param imm			/// \param imm
	/// An immediate value controlling rounding using bits [2:0]: \n			/// An immediate value controlling rounding using bits [2:0]: \n
	/// 000: Nearest \n			/// 000: Nearest \n
	/// 001: Down \n			/// 001: Down \n
	/// 010: Up \n			/// 010: Up \n
	/// 011: Truncate \n			/// 011: Truncate \n
	/// 1XX: Use MXCSR.RC for rounding			/// 1XX: Use MXCSR.RC for rounding
	/// \returns A 128-bit vector containing converted 16-bit half-precision float			/// \returns A 128-bit vector containing the converted 16-bit half-precision
	/// values. The lower 64 bits are used to store the converted 16-bit			/// float values.
	/// half-precision floating-point values.			#define _mm256_cvtps_ph(a, imm) __extension__ ({ \
	#define _mm_cvtps_ph(a, imm) __extension__ ({ \			(__m128i)__builtin_ia32_vcvtps2ph256((__v8sf)(__m256)(a), (imm)); })
	(__m128i)__builtin_ia32_vcvtps2ph((__v4sf)(__m128)(a), (imm)); })

	/// Converts a 128-bit vector containing 16-bit half-precision float			/// Converts a 128-bit vector containing 16-bit half-precision float
	/// values into a 128-bit vector containing 32-bit float values.			/// values into a 256-bit vector of [8 x float].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> VCVTPH2PS </c> instruction.			/// This intrinsic corresponds to the <c> VCVTPH2PS </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 128-bit vector containing 16-bit half-precision float values. The lower			/// A 128-bit vector containing 16-bit half-precision float values to be
	/// 64 bits are used in the conversion.			/// converted to 32-bit single-precision float values.
	/// \returns A 128-bit vector of [4 x float] containing converted float values.			/// \returns A vector of [8 x float] containing the converted 32-bit
	static __inline __m128 __DEFAULT_FN_ATTRS			/// single-precision float values.
	_mm_cvtph_ps(__m128i __a)			static __inline __m256 __attribute__((__always_inline__, __nodebug__, __target__("f16c")))
				_mm256_cvtph_ps(__m128i __a)
	{			{
	return (__m128)__builtin_ia32_vcvtph2ps((__v8hi)__a);			return (__m256)__builtin_ia32_vcvtph2ps256((__v8hi)__a);
	}			}

	#undef __DEFAULT_FN_ATTRS			#undef __DEFAULT_FN_ATTRS

	#endif /* __F16CINTRIN_H */			#endif /* __F16CINTRIN_H */

cfe/trunk/lib/Headers/immintrin.h

	Show First 20 Lines • Show All 63 Lines • ▼ Show 20 Lines

	#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__AVX__)			#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__AVX__)
	#include <avxintrin.h>			#include <avxintrin.h>
	#endif			#endif

	#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__AVX2__)			#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__AVX2__)
	#include <avx2intrin.h>			#include <avx2intrin.h>

	/* The 256-bit versions of functions in f16cintrin.h.			#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__F16C__)
	Intel documents these as being in immintrin.h, and			#include <f16cintrin.h>
	they depend on typedefs from avxintrin.h. */

	/// Converts a 256-bit vector of [8 x float] into a 128-bit vector
	/// containing 16-bit half-precision float values.
	///
	/// \headerfile <x86intrin.h>
	///
	/// \code
	/// __m128i _mm256_cvtps_ph(__m256 a, const int imm);
	/// \endcode
	///
	/// This intrinsic corresponds to the <c> VCVTPS2PH </c> instruction.
	///
	/// \param a
	/// A 256-bit vector containing 32-bit single-precision float values to be
	/// converted to 16-bit half-precision float values.
	/// \param imm
	/// An immediate value controlling rounding using bits [2:0]: \n
	/// 000: Nearest \n
	/// 001: Down \n
	/// 010: Up \n
	/// 011: Truncate \n
	/// 1XX: Use MXCSR.RC for rounding
	/// \returns A 128-bit vector containing the converted 16-bit half-precision
	/// float values.
	#define _mm256_cvtps_ph(a, imm) __extension__ ({ \
	(__m128i)__builtin_ia32_vcvtps2ph256((__v8sf)(__m256)(a), (imm)); })

	/// Converts a 128-bit vector containing 16-bit half-precision float
	/// values into a 256-bit vector of [8 x float].
	///
	/// \headerfile <x86intrin.h>
	///
	/// This intrinsic corresponds to the <c> VCVTPH2PS </c> instruction.
	///
	/// \param __a
	/// A 128-bit vector containing 16-bit half-precision float values to be
	/// converted to 32-bit single-precision float values.
	/// \returns A vector of [8 x float] containing the converted 32-bit
	/// single-precision float values.
	static __inline __m256 __attribute__((__always_inline__, __nodebug__, __target__("f16c")))
	_mm256_cvtph_ps(__m128i __a)
	{
	return (__m256)__builtin_ia32_vcvtph2ps256((__v8hi)__a);
	}
	#endif /* __AVX2__ */

	#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__VPCLMULQDQ__)			#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__VPCLMULQDQ__)
	#include <vpclmulqdqintrin.h>			#include <vpclmulqdqintrin.h>
	#endif			#endif

	#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__BMI__)			#if !defined(_MSC_VER) \|\| __has_feature(modules) \|\| defined(__BMI__)
	#include <bmiintrin.h>			#include <bmiintrin.h>
	#endif			#endif
	▲ Show 20 Lines • Show All 259 Lines • Show Last 20 Lines