This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Basic/
-
clang/
-
Basic/
-
BuiltinsX86.def
-
lib/
-
CodeGen/
-
CGBuiltin.cpp
-
Headers/
-
emmintrin.h
10/13
mmintrin.h
2
tmmintrin.h
4/4
xmmintrin.h
-
test/
-
CodeGen/
-
X86/
-
mmx-builtins.c
-
mmx-shift-with-immediate.c
-
attr-target-x86-mmx.c
-
Headers/
-
xmmintrin.c
-
Sema/
-
x86-builtin-palignr.c
-
llvm/include/llvm/IR/
-
include/
-
llvm/
-
IR/
-
IntrinsicsX86.td
-
mmx-tests/
-
Makefile
-
mmx-tests.py
-
test.c

Differential D86855

Convert __m64 intrinsics to unconditionally use SSE2 instead of MMX instructions.
Needs ReviewPublic

Authored by jyknight on Aug 30 2020, 3:29 PM.

Download Raw Diff

Details

Reviewers

craig.topper
spatel
RKSimon

Summary

Preliminary patch, posted to go along with discussion on llvm-dev.

3DNow! intrinsics are not converted, as of yet.

Tests have not been updated to match new expected IR output. Currently failing:
Clang :: CodeGen/attr-target-x86-mmx.c
Clang :: CodeGen/mmx-builtins.c
Clang :: CodeGen/mmx-shift-with-immediate.c
Clang :: Headers/xmmintrin.c

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jyknight created this revision.Aug 30 2020, 3:29 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 30 2020, 3:29 PM

Herald added subscribers: cfe-commits, danielkiss. · View Herald Transcript

jyknight requested review of this revision.Aug 30 2020, 3:29 PM

Harbormaster completed remote builds in B70053: Diff 288876.Aug 30 2020, 4:02 PM

craig.topper added inline comments.Aug 30 2020, 10:44 PM

clang/lib/Headers/mmintrin.h
380	I think you we should use __v8qu to match what we do in emmintrin.h. We don't currently set nsw on signed vector arithmetic, but we should be careful in case that changes in the future.
1136	I think we probably want to use a v2su or v2si here. Using v1di scalarizes and splits on 32-bit targets. On 64-bit targets it emits GPR code.
1281–1283	Need to use v8qs here to force "signed char" elements. v8qi uses "char" which has platform dependent signedness or can be changed with a command line.
1305	Same here
1436	Is this needed?
1494	I don't think this change is needed. And I think the operands are in the wrong order.
clang/lib/Headers/tmmintrin.h
17–20	I'm worried that using v1di with the shuffles will lead to scalarization in the type legalizer. Should we use v2si instead?
clang/lib/Headers/xmmintrin.h
2320	This doesn't guarantee zeroes in bits 15:8 does it?
2411	Does this work with large pages?

I've finally got back to moving this patch forward -- PTAL, thanks!

To start with, I wrote a simple test-suite to verify the functionality of these changes. I've included the tests I wrote under mmx-tests/ in this version of the patch -- although it doesn't actually belong there. I'm not exactly sure where it _does_ belong, however.

The test-suite runs a number of combinations of inputs against two different compilers' implementations of these intrinsics, and makes sure they produce identical results. I used this to ensure that there are no changes in behavior between old clang and clang after this change, as well as compared clang to GCC. Using that, I've fixed and verified all the bugs you noticed in codereview already, as well as additional bugs the testsuite found (in _mm_maddubs_pi16 and _mm_shuffle_pi8). I'm feeling reasonably confident, now, that this change will not change behavior of these functions. The tests also discovered two bugs in GCC, https://gcc.gnu.org/PR98495, https://gcc.gnu.org/PR98522.

Some other changes in this update:

I switched _mm_extract_pi16 and _mm_insert_pi16 back to using an clang intrinsic, for consistency with the other extract/insert macros, which are using an intrinsic function simply to force the element-number to be a compile-time constant, and produce an error when it's not. But, the intrinsic now lowers to generic IR like all the other __builtin_ia32_vec_{ext,set}_*, rather than an llvm intrinsic forcing MMX. I modified the "composite" functions in xmmintrin.h to directly use 128-bit operations, instead of composites of multiple 64bit operations, where possible.

Finally, the clang tests have been updated, so that all tests pass again.

clang/lib/Headers/mmintrin.h
380	Done, here and everywhere else I was using signed math (except the comparisons).
1136	AFAICT, this doesn't matter? It seems to emit GPR or XMM code just depending on whether the result values are needed as XMM or not, independent of whether the type is specified as v2su or v1du.
1281–1283	Done.
1305	This is a short, which is always signed, so it should be ok as written.
1436	No, reverted this change and the others like it.
1494	Change was unnecessary, so reverted. (But operands are supposed to be backwards here.)
clang/lib/Headers/tmmintrin.h
17–20	Converting `__trunc64` to v4si (and thus v2si return value) seems to make codegen _worse_ in some cases, and I don't see any case where it gets better. For example, #define __trunc64_1(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0) #define __trunc64_2(x) (__m64)__builtin_shufflevector((__v4si)(x), __extension__ (__v4si){}, 0, 1) __m64 trunc1(__m128 a, int i) { return __trunc64_1(__builtin_ia32_psllqi128(a, i)); } __m64 trunc2(__m128 a, int i) { return __trunc64_2(__builtin_ia32_psllqi128(a, i)); } } In trunc2, you get two extraneous moves at the end: movd %edi, %xmm1 psllq %xmm1, %xmm0 movq %xmm0, %rax // extra movq %rax, %xmm0 // extra I guess that's related to calling-convention lowering which turns m64 into "double" confusing the various IR simplifications? Similarly, there's also extraneous moves to/from a GPR for argument passing sometimes. But I don't see an easy way around that. Both variants do that here, instead of just `movq %xmm0, %xmm0`: #define __anyext128_1(x) (__m128i)__builtin_shufflevector((__v1di)(x), __extension__ (__v1di){}, 0, -1) #define __anyext128_2(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, -1, -1) #define __zext128_1(x) (__m128i)__builtin_shufflevector((__v1di)(x), __extension__ (__v1di){}, 0, 1) #define __zext128_2(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, 2, 3) __m128 ext1(__m64 a) { return __builtin_convertvector((__v4si)__zext128_1(a), __v4sf)); } __m128 ext2(__m64 a) { return __builtin_convertvector((__v4si)__zext128_2(a), __v4sf)); } Both produce: movq %xmm0, %rax movq %rax, %xmm0 cvtdq2ps %xmm0, %xmm0 retq However, switching to variant 2 of `anyext128` and `zext128` does seem to improve things in other cases, avoiding _some_ of those sorts of extraneous moves to a scalar register and back again. So I've made that change.
clang/lib/Headers/xmmintrin.h
2320	It does not. Switched to zext128.
2411	Yes -- this needs to be the boundary at which a trap _might_ occur if we crossed it. Whether it's in fact the end of of a page or not is irrelevant, only that it _could_ be.

Herald added a subscriber: pengfei. · View Herald TranscriptJan 6 2021, 8:30 PM

Fix and test.

Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2021, 8:30 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

jyknight mentioned this in D94213: Clang: Remove support for 3DNow!, both intrinsics and builtins..Jan 6 2021, 8:37 PM

craig.topper added inline comments.Jan 6 2021, 11:58 PM

clang/lib/Headers/mmintrin.h
1305	Yeah. I don't know why I wrote that now.

jyknight mentioned this in D94252: Delete (most) of the MMX builtin functions from Clang..Jan 7 2021, 11:40 AM

Ping.

Ping, thanks!

Or, if you have suggestions on how to make it easier to review, I'd be open to that.

mr-c added a subscriber: mr-c.May 27 2023, 3:26 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2023, 3:26 AM

Herald added a subscriber: StephenFan. · View Herald Transcript

Reverse ping. Any progress or plan for this patch?

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

BuiltinsX86.def

4 lines

lib/

CodeGen/

CGBuiltin.cpp

2 lines

Headers/

33 lines

311 lines

90 lines

186 lines

test/

CodeGen/

X86/

mmx-builtins.c

207 lines

mmx-shift-with-immediate.c

16 lines

attr-target-x86-mmx.c

7 lines

Headers/

xmmintrin.c

2 lines

Sema/

x86-builtin-palignr.c

2 lines

llvm/

include/

llvm/

IR/

IntrinsicsX86.td

4 lines

mmx-tests/

Makefile

29 lines

mmx-tests.py

301 lines

test.c

237 lines

Diff 315039

clang/include/clang/Basic/BuiltinsX86.def

	Show First 20 Lines • Show All 151 Lines • ▼ Show 20 Lines
	TARGET_BUILTIN(__builtin_ia32_pmaxsw, "V4sV4sV4s", "ncV:64:", "mmx,sse")			TARGET_BUILTIN(__builtin_ia32_pmaxsw, "V4sV4sV4s", "ncV:64:", "mmx,sse")
	TARGET_BUILTIN(__builtin_ia32_pmaxub, "V8cV8cV8c", "ncV:64:", "mmx,sse")			TARGET_BUILTIN(__builtin_ia32_pmaxub, "V8cV8cV8c", "ncV:64:", "mmx,sse")
	TARGET_BUILTIN(__builtin_ia32_pminsw, "V4sV4sV4s", "ncV:64:", "mmx,sse")			TARGET_BUILTIN(__builtin_ia32_pminsw, "V4sV4sV4s", "ncV:64:", "mmx,sse")
	TARGET_BUILTIN(__builtin_ia32_pminub, "V8cV8cV8c", "ncV:64:", "mmx,sse")			TARGET_BUILTIN(__builtin_ia32_pminub, "V8cV8cV8c", "ncV:64:", "mmx,sse")
	TARGET_BUILTIN(__builtin_ia32_pmovmskb, "iV8c", "ncV:64:", "mmx,sse")			TARGET_BUILTIN(__builtin_ia32_pmovmskb, "iV8c", "ncV:64:", "mmx,sse")
	TARGET_BUILTIN(__builtin_ia32_pmulhuw, "V4sV4sV4s", "ncV:64:", "mmx,sse")			TARGET_BUILTIN(__builtin_ia32_pmulhuw, "V4sV4sV4s", "ncV:64:", "mmx,sse")
	TARGET_BUILTIN(__builtin_ia32_psadbw, "V4sV8cV8c", "ncV:64:", "mmx,sse")			TARGET_BUILTIN(__builtin_ia32_psadbw, "V4sV8cV8c", "ncV:64:", "mmx,sse")
	TARGET_BUILTIN(__builtin_ia32_pshufw, "V4sV4sIc", "ncV:64:", "mmx,sse")			TARGET_BUILTIN(__builtin_ia32_pshufw, "V4sV4sIc", "ncV:64:", "mmx,sse")
	TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "iV4sIi", "ncV:64:", "mmx,sse")			TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "sV4sIi", "ncV:64:", "sse")
	TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4siIi", "ncV:64:", "mmx,sse")			TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4ssIi", "ncV:64:", "sse")

	// MMX+SSE2			// MMX+SSE2
	TARGET_BUILTIN(__builtin_ia32_cvtpd2pi, "V2iV2d", "ncV:64:", "mmx,sse2")			TARGET_BUILTIN(__builtin_ia32_cvtpd2pi, "V2iV2d", "ncV:64:", "mmx,sse2")
	TARGET_BUILTIN(__builtin_ia32_cvtpi2pd, "V2dV2i", "ncV:64:", "mmx,sse2")			TARGET_BUILTIN(__builtin_ia32_cvtpi2pd, "V2dV2i", "ncV:64:", "mmx,sse2")
	TARGET_BUILTIN(__builtin_ia32_cvttpd2pi, "V2iV2d", "ncV:64:", "mmx,sse2")			TARGET_BUILTIN(__builtin_ia32_cvttpd2pi, "V2iV2d", "ncV:64:", "mmx,sse2")
	TARGET_BUILTIN(__builtin_ia32_paddq, "V1OiV1OiV1Oi", "ncV:64:", "mmx,sse2")			TARGET_BUILTIN(__builtin_ia32_paddq, "V1OiV1OiV1Oi", "ncV:64:", "mmx,sse2")
	TARGET_BUILTIN(__builtin_ia32_pmuludq, "V1OiV2iV2i", "ncV:64:", "mmx,sse2")			TARGET_BUILTIN(__builtin_ia32_pmuludq, "V1OiV2iV2i", "ncV:64:", "mmx,sse2")
	TARGET_BUILTIN(__builtin_ia32_psubq, "V1OiV1OiV1Oi", "ncV:64:", "mmx,sse2")			TARGET_BUILTIN(__builtin_ia32_psubq, "V1OiV1OiV1Oi", "ncV:64:", "mmx,sse2")
	▲ Show 20 Lines • Show All 1,814 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 12,091 Lines • ▼ Show 20 Lines	case X86::BI__builtin_ia32_undef512:
// TODO: If we had a "freeze" IR instruction to generate a fixed undef		// TODO: If we had a "freeze" IR instruction to generate a fixed undef
// value, we should use that here instead of a zero.		// value, we should use that here instead of a zero.
return llvm::Constant::getNullValue(ConvertType(E->getType()));		return llvm::Constant::getNullValue(ConvertType(E->getType()));
case X86::BI__builtin_ia32_vec_init_v8qi:		case X86::BI__builtin_ia32_vec_init_v8qi:
case X86::BI__builtin_ia32_vec_init_v4hi:		case X86::BI__builtin_ia32_vec_init_v4hi:
case X86::BI__builtin_ia32_vec_init_v2si:		case X86::BI__builtin_ia32_vec_init_v2si:
return Builder.CreateBitCast(BuildVector(Ops),		return Builder.CreateBitCast(BuildVector(Ops),
llvm::Type::getX86_MMXTy(getLLVMContext()));		llvm::Type::getX86_MMXTy(getLLVMContext()));
		case X86::BI__builtin_ia32_vec_ext_v4hi:
case X86::BI__builtin_ia32_vec_ext_v2si:		case X86::BI__builtin_ia32_vec_ext_v2si:
case X86::BI__builtin_ia32_vec_ext_v16qi:		case X86::BI__builtin_ia32_vec_ext_v16qi:
case X86::BI__builtin_ia32_vec_ext_v8hi:		case X86::BI__builtin_ia32_vec_ext_v8hi:
case X86::BI__builtin_ia32_vec_ext_v4si:		case X86::BI__builtin_ia32_vec_ext_v4si:
case X86::BI__builtin_ia32_vec_ext_v4sf:		case X86::BI__builtin_ia32_vec_ext_v4sf:
case X86::BI__builtin_ia32_vec_ext_v2di:		case X86::BI__builtin_ia32_vec_ext_v2di:
case X86::BI__builtin_ia32_vec_ext_v32qi:		case X86::BI__builtin_ia32_vec_ext_v32qi:
case X86::BI__builtin_ia32_vec_ext_v16hi:		case X86::BI__builtin_ia32_vec_ext_v16hi:
case X86::BI__builtin_ia32_vec_ext_v8si:		case X86::BI__builtin_ia32_vec_ext_v8si:
case X86::BI__builtin_ia32_vec_ext_v4di: {		case X86::BI__builtin_ia32_vec_ext_v4di: {
unsigned NumElts =		unsigned NumElts =
cast<llvm::FixedVectorType>(Ops[0]->getType())->getNumElements();		cast<llvm::FixedVectorType>(Ops[0]->getType())->getNumElements();
uint64_t Index = cast<ConstantInt>(Ops[1])->getZExtValue();		uint64_t Index = cast<ConstantInt>(Ops[1])->getZExtValue();
Index &= NumElts - 1;		Index &= NumElts - 1;
// These builtins exist so we can ensure the index is an ICE and in range.		// These builtins exist so we can ensure the index is an ICE and in range.
// Otherwise we could just do this in the header file.		// Otherwise we could just do this in the header file.
return Builder.CreateExtractElement(Ops[0], Index);		return Builder.CreateExtractElement(Ops[0], Index);
}		}
		case X86::BI__builtin_ia32_vec_set_v4hi:
case X86::BI__builtin_ia32_vec_set_v16qi:		case X86::BI__builtin_ia32_vec_set_v16qi:
case X86::BI__builtin_ia32_vec_set_v8hi:		case X86::BI__builtin_ia32_vec_set_v8hi:
case X86::BI__builtin_ia32_vec_set_v4si:		case X86::BI__builtin_ia32_vec_set_v4si:
case X86::BI__builtin_ia32_vec_set_v2di:		case X86::BI__builtin_ia32_vec_set_v2di:
case X86::BI__builtin_ia32_vec_set_v32qi:		case X86::BI__builtin_ia32_vec_set_v32qi:
case X86::BI__builtin_ia32_vec_set_v16hi:		case X86::BI__builtin_ia32_vec_set_v16hi:
case X86::BI__builtin_ia32_vec_set_v8si:		case X86::BI__builtin_ia32_vec_set_v8si:
case X86::BI__builtin_ia32_vec_set_v4di: {		case X86::BI__builtin_ia32_vec_set_v4di: {
▲ Show 20 Lines • Show All 5,296 Lines • Show Last 20 Lines

clang/lib/Headers/emmintrin.h

	Show All 29 Lines
	typedef unsigned char __v16qu __attribute__((__vector_size__(16)));			typedef unsigned char __v16qu __attribute__((__vector_size__(16)));

	/* We need an explicitly signed variant for char. Note that this shouldn't			/* We need an explicitly signed variant for char. Note that this shouldn't
	* appear in the interface though. */			* appear in the interface though. */
	typedef signed char __v16qs __attribute__((__vector_size__(16)));			typedef signed char __v16qs __attribute__((__vector_size__(16)));

	/* Define the default attributes for the functions in this file. */			/* Define the default attributes for the functions in this file. */
	#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("sse2"), __min_vector_width__(128)))			#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("sse2"), __min_vector_width__(128)))
	#define __DEFAULT_FN_ATTRS_MMX __attribute__((__always_inline__, __nodebug__, __target__("mmx,sse2"), __min_vector_width__(64)))
				#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0)
				#define __anyext128(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, -1, -1)

	/// Adds lower double-precision values in both operands and returns the			/// Adds lower double-precision values in both operands and returns the
	/// sum in the lower 64 bits of the result. The upper 64 bits of the result			/// sum in the lower 64 bits of the result. The upper 64 bits of the result
	/// are copied from the upper double-precision value of the first operand.			/// are copied from the upper double-precision value of the first operand.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> VADDSD / ADDSD </c> instruction.			/// This intrinsic corresponds to the <c> VADDSD / ADDSD </c> instruction.
	▲ Show 20 Lines • Show All 1,452 Lines • ▼ Show 20 Lines
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTPD2PI </c> instruction.			/// This intrinsic corresponds to the <c> CVTPD2PI </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 128-bit vector of [2 x double].			/// A 128-bit vector of [2 x double].
	/// \returns A 64-bit vector of [2 x i32] containing the converted values.			/// \returns A 64-bit vector of [2 x i32] containing the converted values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_cvtpd_pi32(__m128d __a)			_mm_cvtpd_pi32(__m128d __a)
	{			{
	return (__m64)__builtin_ia32_cvtpd2pi((__v2df)__a);			return __trunc64(__builtin_ia32_cvtpd2dq((__v2df)__a));
	}			}

	/// Converts the two double-precision floating-point elements of a			/// Converts the two double-precision floating-point elements of a
	/// 128-bit vector of [2 x double] into two signed 32-bit integer values,			/// 128-bit vector of [2 x double] into two signed 32-bit integer values,
	/// returned in a 64-bit vector of [2 x i32].			/// returned in a 64-bit vector of [2 x i32].
	///			///
	/// If the result of either conversion is inexact, the result is truncated			/// If the result of either conversion is inexact, the result is truncated
	/// (rounded towards zero) regardless of the current MXCSR setting.			/// (rounded towards zero) regardless of the current MXCSR setting.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTTPD2PI </c> instruction.			/// This intrinsic corresponds to the <c> CVTTPD2PI </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 128-bit vector of [2 x double].			/// A 128-bit vector of [2 x double].
	/// \returns A 64-bit vector of [2 x i32] containing the converted values.			/// \returns A 64-bit vector of [2 x i32] containing the converted values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_cvttpd_pi32(__m128d __a)			_mm_cvttpd_pi32(__m128d __a)
	{			{
	return (__m64)__builtin_ia32_cvttpd2pi((__v2df)__a);			return __trunc64(__builtin_ia32_cvttpd2dq((__v2df)__a));
	}			}

	/// Converts the two signed 32-bit integer elements of a 64-bit vector of			/// Converts the two signed 32-bit integer elements of a 64-bit vector of
	/// [2 x i32] into two double-precision floating-point values, returned in a			/// [2 x i32] into two double-precision floating-point values, returned in a
	/// 128-bit vector of [2 x double].			/// 128-bit vector of [2 x double].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTPI2PD </c> instruction.			/// This intrinsic corresponds to the <c> CVTPI2PD </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of [2 x i32].			/// A 64-bit vector of [2 x i32].
	/// \returns A 128-bit vector of [2 x double] containing the converted values.			/// \returns A 128-bit vector of [2 x double] containing the converted values.
	static __inline__ __m128d __DEFAULT_FN_ATTRS_MMX			static __inline__ __m128d __DEFAULT_FN_ATTRS
	_mm_cvtpi32_pd(__m64 __a)			_mm_cvtpi32_pd(__m64 __a)
	{			{
	return __builtin_ia32_cvtpi2pd((__v2si)__a);			return (__m128d) __builtin_convertvector((__v2si)__a, __v2df);
	}			}

	/// Returns the low-order element of a 128-bit vector of [2 x double] as			/// Returns the low-order element of a 128-bit vector of [2 x double] as
	/// a double-precision floating-point value.			/// a double-precision floating-point value.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic has no corresponding instruction.			/// This intrinsic has no corresponding instruction.
	▲ Show 20 Lines • Show All 614 Lines • ▼ Show 20 Lines
	///			///
	/// This intrinsic corresponds to the <c> PADDQ </c> instruction.			/// This intrinsic corresponds to the <c> PADDQ </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer.			/// A 64-bit integer.
	/// \param __b			/// \param __b
	/// A 64-bit integer.			/// A 64-bit integer.
	/// \returns A 64-bit integer containing the sum of both parameters.			/// \returns A 64-bit integer containing the sum of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_add_si64(__m64 __a, __m64 __b)			_mm_add_si64(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_paddq((__v1di)__a, (__v1di)__b);			return (__m64)(((unsigned long long)__a) + ((unsigned long long)__b));
	}			}

	/// Adds the corresponding elements of two 128-bit vectors of [2 x i64],			/// Adds the corresponding elements of two 128-bit vectors of [2 x i64],
	/// saving the lower 64 bits of each sum in the corresponding element of a			/// saving the lower 64 bits of each sum in the corresponding element of a
	/// 128-bit result vector of [2 x i64].			/// 128-bit result vector of [2 x i64].
	///			///
	/// The integer elements of both parameters can be either signed or unsigned.			/// The integer elements of both parameters can be either signed or unsigned.
	///			///
	▲ Show 20 Lines • Show All 312 Lines • ▼ Show 20 Lines
	///			///
	/// This intrinsic corresponds to the <c> PMULUDQ </c> instruction.			/// This intrinsic corresponds to the <c> PMULUDQ </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer containing one of the source operands.			/// A 64-bit integer containing one of the source operands.
	/// \param __b			/// \param __b
	/// A 64-bit integer containing one of the source operands.			/// A 64-bit integer containing one of the source operands.
	/// \returns A 64-bit integer vector containing the product of both operands.			/// \returns A 64-bit integer vector containing the product of both operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_mul_su32(__m64 __a, __m64 __b)			_mm_mul_su32(__m64 __a, __m64 __b)
	{			{
	return __builtin_ia32_pmuludq((__v2si)__a, (__v2si)__b);			return __trunc64(__builtin_ia32_pmuludq128((__v4si)__anyext128(__a),
				(__v4si)__anyext128(__b)));
	}			}

	/// Multiplies 32-bit unsigned integer values contained in the lower			/// Multiplies 32-bit unsigned integer values contained in the lower
	/// bits of the corresponding elements of two [2 x i64] vectors, and returns			/// bits of the corresponding elements of two [2 x i64] vectors, and returns
	/// the 64-bit products in the corresponding elements of a [2 x i64] vector.			/// the 64-bit products in the corresponding elements of a [2 x i64] vector.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines
	/// This intrinsic corresponds to the <c> PSUBQ </c> instruction.			/// This intrinsic corresponds to the <c> PSUBQ </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing the minuend.			/// A 64-bit integer vector containing the minuend.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing the subtrahend.			/// A 64-bit integer vector containing the subtrahend.
	/// \returns A 64-bit integer vector containing the difference of the values in			/// \returns A 64-bit integer vector containing the difference of the values in
	/// the operands.			/// the operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_sub_si64(__m64 __a, __m64 __b)			_mm_sub_si64(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_psubq((__v1di)__a, (__v1di)__b);			return (__m64)((unsigned long long)__a - (unsigned long long)__b);
	}			}

	/// Subtracts the corresponding elements of two [2 x i64] vectors.			/// Subtracts the corresponding elements of two [2 x i64] vectors.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> VPSUBQ / PSUBQ </c> instruction.			/// This intrinsic corresponds to the <c> VPSUBQ / PSUBQ </c> instruction.
	///			///
	▲ Show 20 Lines • Show All 2,324 Lines • ▼ Show 20 Lines
	///			///
	/// This intrinsic corresponds to the <c> PAUSE </c> instruction.			/// This intrinsic corresponds to the <c> PAUSE </c> instruction.
	///			///
	void _mm_pause(void);			void _mm_pause(void);

	#if defined(__cplusplus)			#if defined(__cplusplus)
	} // extern "C"			} // extern "C"
	#endif			#endif

				#undef __anyext128
				#undef __trunc64
	#undef __DEFAULT_FN_ATTRS			#undef __DEFAULT_FN_ATTRS
	#undef __DEFAULT_FN_ATTRS_MMX

	#define _MM_SHUFFLE2(x, y) (((x) << 1) \| (y))			#define _MM_SHUFFLE2(x, y) (((x) << 1) \| (y))

	#define _MM_DENORMALS_ZERO_ON (0x0040U)			#define _MM_DENORMALS_ZERO_ON (0x0040U)
	#define _MM_DENORMALS_ZERO_OFF (0x0000U)			#define _MM_DENORMALS_ZERO_OFF (0x0000U)

	#define _MM_DENORMALS_ZERO_MASK (0x0040U)			#define _MM_DENORMALS_ZERO_MASK (0x0040U)

	#define _MM_GET_DENORMALS_ZERO_MODE() (_mm_getcsr() & _MM_DENORMALS_ZERO_MASK)			#define _MM_GET_DENORMALS_ZERO_MODE() (_mm_getcsr() & _MM_DENORMALS_ZERO_MASK)
	#define _MM_SET_DENORMALS_ZERO_MODE(x) (_mm_setcsr((_mm_getcsr() & ~_MM_DENORMALS_ZERO_MASK) \| (x)))			#define _MM_SET_DENORMALS_ZERO_MODE(x) (_mm_setcsr((_mm_getcsr() & ~_MM_DENORMALS_ZERO_MASK) \| (x)))

	#endif /* __EMMINTRIN_H */			#endif /* __EMMINTRIN_H */

clang/lib/Headers/mmintrin.h

	Show All 11 Lines

	typedef long long __m64 __attribute__((__vector_size__(8), __aligned__(8)));			typedef long long __m64 __attribute__((__vector_size__(8), __aligned__(8)));

	typedef long long __v1di __attribute__((__vector_size__(8)));			typedef long long __v1di __attribute__((__vector_size__(8)));
	typedef int __v2si __attribute__((__vector_size__(8)));			typedef int __v2si __attribute__((__vector_size__(8)));
	typedef short __v4hi __attribute__((__vector_size__(8)));			typedef short __v4hi __attribute__((__vector_size__(8)));
	typedef char __v8qi __attribute__((__vector_size__(8)));			typedef char __v8qi __attribute__((__vector_size__(8)));

				/* Unsigned types */
				typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8)));
				typedef unsigned int __v2su __attribute__ ((__vector_size__ (8)));
				typedef unsigned short __v4hu __attribute__((__vector_size__(8)));
				typedef unsigned char __v8qu __attribute__((__vector_size__(8)));

				/* We need an explicitly signed variant for char. Note that this shouldn't
				* appear in the interface though. */
				typedef signed char __v8qs __attribute__((__vector_size__(8)));

				/* SSE/SSE2 types */
				typedef long long __m128i __attribute__((__vector_size__(16), __aligned__(16)));
				typedef long long __v2di __attribute__ ((__vector_size__ (16)));
				typedef int __v4si __attribute__((__vector_size__(16)));
				typedef short __v8hi __attribute__((__vector_size__(16)));
				typedef char __v16qi __attribute__((__vector_size__(16)));

	/* Define the default attributes for the functions in this file. */			/* Define the default attributes for the functions in this file. */
	#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("mmx"), __min_vector_width__(64)))			#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, __target__("sse2"), __min_vector_width__(64)))

				#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0)
				#define __anyext128(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, -1, -1)
				#define __extract2_32(a) (__m64)__builtin_shufflevector((__v4si)(a), __extension__ (__v4si){}, 0, 2);

	/// Clears the MMX state by setting the state of the x87 stack registers			/// Clears the MMX state by setting the state of the x87 stack registers
	/// to empty.			/// to empty.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> EMMS </c> instruction.			/// This intrinsic corresponds to the <c> EMMS </c> instruction.
	///			///
	Show All 9 Lines
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> MOVD </c> instruction.			/// This intrinsic corresponds to the <c> MOVD </c> instruction.
	///			///
	/// \param __i			/// \param __i
	/// A 32-bit integer value.			/// A 32-bit integer value.
	/// \returns A 64-bit integer vector. The lower 32 bits contain the value of the			/// \returns A 64-bit integer vector. The lower 32 bits contain the value of the
	/// parameter. The upper 32 bits are set to 0.			/// parameter. The upper 32 bits are set to 0.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtsi32_si64(int __i)			_mm_cvtsi32_si64(int __i)
	{			{
	return (__m64)__builtin_ia32_vec_init_v2si(__i, 0);			return __extension__ (__m64)(__v2si){__i, 0};
	}			}

	/// Returns the lower 32 bits of a 64-bit integer vector as a 32-bit			/// Returns the lower 32 bits of a 64-bit integer vector as a 32-bit
	/// signed integer.			/// signed integer.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> MOVD </c> instruction.			/// This intrinsic corresponds to the <c> MOVD </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector.			/// A 64-bit integer vector.
	/// \returns A 32-bit signed integer value containing the lower 32 bits of the			/// \returns A 32-bit signed integer value containing the lower 32 bits of the
	/// parameter.			/// parameter.
	static __inline__ int __DEFAULT_FN_ATTRS			static __inline__ int __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtsi64_si32(__m64 __m)			_mm_cvtsi64_si32(__m64 __m)
	{			{
	return __builtin_ia32_vec_ext_v2si((__v2si)__m, 0);			return ((__v2si)__m)[0];
	}			}

	/// Casts a 64-bit signed integer value into a 64-bit integer vector.			/// Casts a 64-bit signed integer value into a 64-bit integer vector.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> MOVQ </c> instruction.			/// This intrinsic corresponds to the <c> MOVQ </c> instruction.
	///			///
	/// \param __i			/// \param __i
	/// A 64-bit signed integer.			/// A 64-bit signed integer.
	/// \returns A 64-bit integer vector containing the same bitwise pattern as the			/// \returns A 64-bit integer vector containing the same bitwise pattern as the
	/// parameter.			/// parameter.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtsi64_m64(long long __i)			_mm_cvtsi64_m64(long long __i)
	{			{
	return (__m64)__i;			return (__m64)__i;
	}			}

	/// Casts a 64-bit integer vector into a 64-bit signed integer value.			/// Casts a 64-bit integer vector into a 64-bit signed integer value.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> MOVQ </c> instruction.			/// This intrinsic corresponds to the <c> MOVQ </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector.			/// A 64-bit integer vector.
	/// \returns A 64-bit signed integer containing the same bitwise pattern as the			/// \returns A 64-bit signed integer containing the same bitwise pattern as the
	/// parameter.			/// parameter.
	static __inline__ long long __DEFAULT_FN_ATTRS			static __inline__ long long __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtm64_si64(__m64 __m)			_mm_cvtm64_si64(__m64 __m)
	{			{
	return (long long)__m;			return (long long)__m;
	}			}

	/// Converts 16-bit signed integers from both 64-bit integer vector			/// Converts 16-bit signed integers from both 64-bit integer vector
	/// parameters of [4 x i16] into 8-bit signed integer values, and constructs			/// parameters of [4 x i16] into 8-bit signed integer values, and constructs
	/// a 64-bit integer vector of [8 x i8] as the result. Positive values			/// a 64-bit integer vector of [8 x i8] as the result. Positive values
	Show All 13 Lines
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16]. Each 16-bit element is treated as a			/// A 64-bit integer vector of [4 x i16]. Each 16-bit element is treated as a
	/// 16-bit signed integer and is converted to an 8-bit signed integer with			/// 16-bit signed integer and is converted to an 8-bit signed integer with
	/// saturation. Positive values greater than 0x7F are saturated to 0x7F.			/// saturation. Positive values greater than 0x7F are saturated to 0x7F.
	/// Negative values less than 0x80 are saturated to 0x80. The converted			/// Negative values less than 0x80 are saturated to 0x80. The converted
	/// [4 x i8] values are written to the upper 32 bits of the result.			/// [4 x i8] values are written to the upper 32 bits of the result.
	/// \returns A 64-bit integer vector of [8 x i8] containing the converted			/// \returns A 64-bit integer vector of [8 x i8] containing the converted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_packs_pi16(__m64 __m1, __m64 __m2)			_mm_packs_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_packsswb((__v4hi)__m1, (__v4hi)__m2);			return __extract2_32(__builtin_ia32_packsswb128((__v8hi)__anyext128(__m1),
				(__v8hi)__anyext128(__m2)));
	}			}

	/// Converts 32-bit signed integers from both 64-bit integer vector			/// Converts 32-bit signed integers from both 64-bit integer vector
	/// parameters of [2 x i32] into 16-bit signed integer values, and constructs			/// parameters of [2 x i32] into 16-bit signed integer values, and constructs
	/// a 64-bit integer vector of [4 x i16] as the result. Positive values			/// a 64-bit integer vector of [4 x i16] as the result. Positive values
	/// greater than 0x7FFF are saturated to 0x7FFF. Negative values less than			/// greater than 0x7FFF are saturated to 0x7FFF. Negative values less than
	/// 0x8000 are saturated to 0x8000.			/// 0x8000 are saturated to 0x8000.
	///			///
	Show All 10 Lines
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [2 x i32]. Each 32-bit element is treated as a			/// A 64-bit integer vector of [2 x i32]. Each 32-bit element is treated as a
	/// 32-bit signed integer and is converted to a 16-bit signed integer with			/// 32-bit signed integer and is converted to a 16-bit signed integer with
	/// saturation. Positive values greater than 0x7FFF are saturated to 0x7FFF.			/// saturation. Positive values greater than 0x7FFF are saturated to 0x7FFF.
	/// Negative values less than 0x8000 are saturated to 0x8000. The converted			/// Negative values less than 0x8000 are saturated to 0x8000. The converted
	/// [2 x i16] values are written to the upper 32 bits of the result.			/// [2 x i16] values are written to the upper 32 bits of the result.
	/// \returns A 64-bit integer vector of [4 x i16] containing the converted			/// \returns A 64-bit integer vector of [4 x i16] containing the converted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_packs_pi32(__m64 __m1, __m64 __m2)			_mm_packs_pi32(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_packssdw((__v2si)__m1, (__v2si)__m2);			return __extract2_32(__builtin_ia32_packssdw128((__v4si)__anyext128(__m1),
				(__v4si)__anyext128(__m2)));
	}			}

	/// Converts 16-bit signed integers from both 64-bit integer vector			/// Converts 16-bit signed integers from both 64-bit integer vector
	/// parameters of [4 x i16] into 8-bit unsigned integer values, and			/// parameters of [4 x i16] into 8-bit unsigned integer values, and
	/// constructs a 64-bit integer vector of [8 x i8] as the result. Values			/// constructs a 64-bit integer vector of [8 x i8] as the result. Values
	/// greater than 0xFF are saturated to 0xFF. Values less than 0 are saturated			/// greater than 0xFF are saturated to 0xFF. Values less than 0 are saturated
	/// to 0.			/// to 0.
	///			///
	Show All 10 Lines
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16]. Each 16-bit element is treated as a			/// A 64-bit integer vector of [4 x i16]. Each 16-bit element is treated as a
	/// 16-bit signed integer and is converted to an 8-bit unsigned integer with			/// 16-bit signed integer and is converted to an 8-bit unsigned integer with
	/// saturation. Values greater than 0xFF are saturated to 0xFF. Values less			/// saturation. Values greater than 0xFF are saturated to 0xFF. Values less
	/// than 0 are saturated to 0. The converted [4 x i8] values are written to			/// than 0 are saturated to 0. The converted [4 x i8] values are written to
	/// the upper 32 bits of the result.			/// the upper 32 bits of the result.
	/// \returns A 64-bit integer vector of [8 x i8] containing the converted			/// \returns A 64-bit integer vector of [8 x i8] containing the converted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_packs_pu16(__m64 __m1, __m64 __m2)			_mm_packs_pu16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_packuswb((__v4hi)__m1, (__v4hi)__m2);			return __extract2_32(__builtin_ia32_packuswb128((__v8hi)__anyext128(__m1),
				(__v8hi)__anyext128(__m2)));
	}			}

	/// Unpacks the upper 32 bits from two 64-bit integer vectors of [8 x i8]			/// Unpacks the upper 32 bits from two 64-bit integer vectors of [8 x i8]
	/// and interleaves them into a 64-bit integer vector of [8 x i8].			/// and interleaves them into a 64-bit integer vector of [8 x i8].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PUNPCKHBW </c> instruction.			/// This intrinsic corresponds to the <c> PUNPCKHBW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [8 x i8]. \n			/// A 64-bit integer vector of [8 x i8]. \n
	/// Bits [39:32] are written to bits [7:0] of the result. \n			/// Bits [39:32] are written to bits [7:0] of the result. \n
	/// Bits [47:40] are written to bits [23:16] of the result. \n			/// Bits [47:40] are written to bits [23:16] of the result. \n
	/// Bits [55:48] are written to bits [39:32] of the result. \n			/// Bits [55:48] are written to bits [39:32] of the result. \n
	/// Bits [63:56] are written to bits [55:48] of the result.			/// Bits [63:56] are written to bits [55:48] of the result.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// Bits [39:32] are written to bits [15:8] of the result. \n			/// Bits [39:32] are written to bits [15:8] of the result. \n
	/// Bits [47:40] are written to bits [31:24] of the result. \n			/// Bits [47:40] are written to bits [31:24] of the result. \n
	/// Bits [55:48] are written to bits [47:40] of the result. \n			/// Bits [55:48] are written to bits [47:40] of the result. \n
	/// Bits [63:56] are written to bits [63:56] of the result.			/// Bits [63:56] are written to bits [63:56] of the result.
	/// \returns A 64-bit integer vector of [8 x i8] containing the interleaved			/// \returns A 64-bit integer vector of [8 x i8] containing the interleaved
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_unpackhi_pi8(__m64 __m1, __m64 __m2)			_mm_unpackhi_pi8(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_punpckhbw((__v8qi)__m1, (__v8qi)__m2);			return (__m64)__builtin_shufflevector((__v8qi)__m1, (__v8qi)__m2,
				4, 12, 5, 13, 6, 14, 7, 15);
	}			}

	/// Unpacks the upper 32 bits from two 64-bit integer vectors of			/// Unpacks the upper 32 bits from two 64-bit integer vectors of
	/// [4 x i16] and interleaves them into a 64-bit integer vector of [4 x i16].			/// [4 x i16] and interleaves them into a 64-bit integer vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PUNPCKHWD </c> instruction.			/// This intrinsic corresponds to the <c> PUNPCKHWD </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// Bits [47:32] are written to bits [15:0] of the result. \n			/// Bits [47:32] are written to bits [15:0] of the result. \n
	/// Bits [63:48] are written to bits [47:32] of the result.			/// Bits [63:48] are written to bits [47:32] of the result.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// Bits [47:32] are written to bits [31:16] of the result. \n			/// Bits [47:32] are written to bits [31:16] of the result. \n
	/// Bits [63:48] are written to bits [63:48] of the result.			/// Bits [63:48] are written to bits [63:48] of the result.
	/// \returns A 64-bit integer vector of [4 x i16] containing the interleaved			/// \returns A 64-bit integer vector of [4 x i16] containing the interleaved
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_unpackhi_pi16(__m64 __m1, __m64 __m2)			_mm_unpackhi_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_punpckhwd((__v4hi)__m1, (__v4hi)__m2);			return (__m64)__builtin_shufflevector((__v4hi)__m1, (__v4hi)__m2,
				2, 6, 3, 7);
	}			}

	/// Unpacks the upper 32 bits from two 64-bit integer vectors of			/// Unpacks the upper 32 bits from two 64-bit integer vectors of
	/// [2 x i32] and interleaves them into a 64-bit integer vector of [2 x i32].			/// [2 x i32] and interleaves them into a 64-bit integer vector of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PUNPCKHDQ </c> instruction.			/// This intrinsic corresponds to the <c> PUNPCKHDQ </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [2 x i32]. The upper 32 bits are written to			/// A 64-bit integer vector of [2 x i32]. The upper 32 bits are written to
	/// the lower 32 bits of the result.			/// the lower 32 bits of the result.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [2 x i32]. The upper 32 bits are written to			/// A 64-bit integer vector of [2 x i32]. The upper 32 bits are written to
	/// the upper 32 bits of the result.			/// the upper 32 bits of the result.
	/// \returns A 64-bit integer vector of [2 x i32] containing the interleaved			/// \returns A 64-bit integer vector of [2 x i32] containing the interleaved
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_unpackhi_pi32(__m64 __m1, __m64 __m2)			_mm_unpackhi_pi32(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_punpckhdq((__v2si)__m1, (__v2si)__m2);			return (__m64)__builtin_shufflevector((__v2si)__m1, (__v2si)__m2, 1, 3);
	}			}

	/// Unpacks the lower 32 bits from two 64-bit integer vectors of [8 x i8]			/// Unpacks the lower 32 bits from two 64-bit integer vectors of [8 x i8]
	/// and interleaves them into a 64-bit integer vector of [8 x i8].			/// and interleaves them into a 64-bit integer vector of [8 x i8].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PUNPCKLBW </c> instruction.			/// This intrinsic corresponds to the <c> PUNPCKLBW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// Bits [7:0] are written to bits [7:0] of the result. \n			/// Bits [7:0] are written to bits [7:0] of the result. \n
	/// Bits [15:8] are written to bits [23:16] of the result. \n			/// Bits [15:8] are written to bits [23:16] of the result. \n
	/// Bits [23:16] are written to bits [39:32] of the result. \n			/// Bits [23:16] are written to bits [39:32] of the result. \n
	/// Bits [31:24] are written to bits [55:48] of the result.			/// Bits [31:24] are written to bits [55:48] of the result.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// Bits [7:0] are written to bits [15:8] of the result. \n			/// Bits [7:0] are written to bits [15:8] of the result. \n
	/// Bits [15:8] are written to bits [31:24] of the result. \n			/// Bits [15:8] are written to bits [31:24] of the result. \n
	/// Bits [23:16] are written to bits [47:40] of the result. \n			/// Bits [23:16] are written to bits [47:40] of the result. \n
	/// Bits [31:24] are written to bits [63:56] of the result.			/// Bits [31:24] are written to bits [63:56] of the result.
	/// \returns A 64-bit integer vector of [8 x i8] containing the interleaved			/// \returns A 64-bit integer vector of [8 x i8] containing the interleaved
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_unpacklo_pi8(__m64 __m1, __m64 __m2)			_mm_unpacklo_pi8(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_punpcklbw((__v8qi)__m1, (__v8qi)__m2);			return (__m64)__builtin_shufflevector((__v8qi)__m1, (__v8qi)__m2,
				0, 8, 1, 9, 2, 10, 3, 11);
	}			}

	/// Unpacks the lower 32 bits from two 64-bit integer vectors of			/// Unpacks the lower 32 bits from two 64-bit integer vectors of
	/// [4 x i16] and interleaves them into a 64-bit integer vector of [4 x i16].			/// [4 x i16] and interleaves them into a 64-bit integer vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PUNPCKLWD </c> instruction.			/// This intrinsic corresponds to the <c> PUNPCKLWD </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// Bits [15:0] are written to bits [15:0] of the result. \n			/// Bits [15:0] are written to bits [15:0] of the result. \n
	/// Bits [31:16] are written to bits [47:32] of the result.			/// Bits [31:16] are written to bits [47:32] of the result.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// Bits [15:0] are written to bits [31:16] of the result. \n			/// Bits [15:0] are written to bits [31:16] of the result. \n
	/// Bits [31:16] are written to bits [63:48] of the result.			/// Bits [31:16] are written to bits [63:48] of the result.
	/// \returns A 64-bit integer vector of [4 x i16] containing the interleaved			/// \returns A 64-bit integer vector of [4 x i16] containing the interleaved
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_unpacklo_pi16(__m64 __m1, __m64 __m2)			_mm_unpacklo_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_punpcklwd((__v4hi)__m1, (__v4hi)__m2);			return (__m64)__builtin_shufflevector((__v4hi)__m1, (__v4hi)__m2,
				0, 4, 1, 5);
	}			}

	/// Unpacks the lower 32 bits from two 64-bit integer vectors of			/// Unpacks the lower 32 bits from two 64-bit integer vectors of
	/// [2 x i32] and interleaves them into a 64-bit integer vector of [2 x i32].			/// [2 x i32] and interleaves them into a 64-bit integer vector of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PUNPCKLDQ </c> instruction.			/// This intrinsic corresponds to the <c> PUNPCKLDQ </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [2 x i32]. The lower 32 bits are written to			/// A 64-bit integer vector of [2 x i32]. The lower 32 bits are written to
	/// the lower 32 bits of the result.			/// the lower 32 bits of the result.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [2 x i32]. The lower 32 bits are written to			/// A 64-bit integer vector of [2 x i32]. The lower 32 bits are written to
	/// the upper 32 bits of the result.			/// the upper 32 bits of the result.
	/// \returns A 64-bit integer vector of [2 x i32] containing the interleaved			/// \returns A 64-bit integer vector of [2 x i32] containing the interleaved
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_unpacklo_pi32(__m64 __m1, __m64 __m2)			_mm_unpacklo_pi32(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_punpckldq((__v2si)__m1, (__v2si)__m2);			return (__m64)__builtin_shufflevector((__v2si)__m1, (__v2si)__m2, 0, 2);
	}			}

	/// Adds each 8-bit integer element of the first 64-bit integer vector			/// Adds each 8-bit integer element of the first 64-bit integer vector
	/// of [8 x i8] to the corresponding 8-bit integer element of the second			/// of [8 x i8] to the corresponding 8-bit integer element of the second
	/// 64-bit integer vector of [8 x i8]. The lower 8 bits of the results are			/// 64-bit integer vector of [8 x i8]. The lower 8 bits of the results are
	/// packed into a 64-bit integer vector of [8 x i8].			/// packed into a 64-bit integer vector of [8 x i8].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PADDB </c> instruction.			/// This intrinsic corresponds to the <c> PADDB </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// \returns A 64-bit integer vector of [8 x i8] containing the sums of both			/// \returns A 64-bit integer vector of [8 x i8] containing the sums of both
	/// parameters.			/// parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_add_pi8(__m64 __m1, __m64 __m2)			_mm_add_pi8(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_paddb((__v8qi)__m1, (__v8qi)__m2);			return (__m64)(((__v8qu)__m1) + ((__v8qu)__m2));
				craig.topperUnsubmitted Done Reply Inline Actions I think you we should use __v8qu to match what we do in emmintrin.h. We don't currently set nsw on signed vector arithmetic, but we should be careful in case that changes in the future. craig.topper: I think you we should use __v8qu to match what we do in emmintrin.h. We don't currently set nsw…
				jyknightAuthorUnsubmitted Done Reply Inline Actions Done, here and everywhere else I was using signed math (except the comparisons). jyknight: Done, here and everywhere else I was using signed math (except the comparisons).
	}			}

	/// Adds each 16-bit integer element of the first 64-bit integer vector			/// Adds each 16-bit integer element of the first 64-bit integer vector
	/// of [4 x i16] to the corresponding 16-bit integer element of the second			/// of [4 x i16] to the corresponding 16-bit integer element of the second
	/// 64-bit integer vector of [4 x i16]. The lower 16 bits of the results are			/// 64-bit integer vector of [4 x i16]. The lower 16 bits of the results are
	/// packed into a 64-bit integer vector of [4 x i16].			/// packed into a 64-bit integer vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PADDW </c> instruction.			/// This intrinsic corresponds to the <c> PADDW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \returns A 64-bit integer vector of [4 x i16] containing the sums of both			/// \returns A 64-bit integer vector of [4 x i16] containing the sums of both
	/// parameters.			/// parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_add_pi16(__m64 __m1, __m64 __m2)			_mm_add_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_paddw((__v4hi)__m1, (__v4hi)__m2);			return (__m64)(((__v4hu)__m1) + ((__v4hu)__m2));
	}			}

	/// Adds each 32-bit integer element of the first 64-bit integer vector			/// Adds each 32-bit integer element of the first 64-bit integer vector
	/// of [2 x i32] to the corresponding 32-bit integer element of the second			/// of [2 x i32] to the corresponding 32-bit integer element of the second
	/// 64-bit integer vector of [2 x i32]. The lower 32 bits of the results are			/// 64-bit integer vector of [2 x i32]. The lower 32 bits of the results are
	/// packed into a 64-bit integer vector of [2 x i32].			/// packed into a 64-bit integer vector of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PADDD </c> instruction.			/// This intrinsic corresponds to the <c> PADDD </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \returns A 64-bit integer vector of [2 x i32] containing the sums of both			/// \returns A 64-bit integer vector of [2 x i32] containing the sums of both
	/// parameters.			/// parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_add_pi32(__m64 __m1, __m64 __m2)			_mm_add_pi32(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_paddd((__v2si)__m1, (__v2si)__m2);			return (__m64)(((__v2su)__m1) + ((__v2su)__m2));
	}			}

	/// Adds each 8-bit signed integer element of the first 64-bit integer			/// Adds each 8-bit signed integer element of the first 64-bit integer
	/// vector of [8 x i8] to the corresponding 8-bit signed integer element of			/// vector of [8 x i8] to the corresponding 8-bit signed integer element of
	/// the second 64-bit integer vector of [8 x i8]. Positive sums greater than			/// the second 64-bit integer vector of [8 x i8]. Positive sums greater than
	/// 0x7F are saturated to 0x7F. Negative sums less than 0x80 are saturated to			/// 0x7F are saturated to 0x7F. Negative sums less than 0x80 are saturated to
	/// 0x80. The results are packed into a 64-bit integer vector of [8 x i8].			/// 0x80. The results are packed into a 64-bit integer vector of [8 x i8].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PADDSB </c> instruction.			/// This intrinsic corresponds to the <c> PADDSB </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// \returns A 64-bit integer vector of [8 x i8] containing the saturated sums			/// \returns A 64-bit integer vector of [8 x i8] containing the saturated sums
	/// of both parameters.			/// of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_adds_pi8(__m64 __m1, __m64 __m2)			_mm_adds_pi8(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_paddsb((__v8qi)__m1, (__v8qi)__m2);			return __trunc64(__builtin_ia32_paddsb128((__v16qi)__anyext128(__m1),
				(__v16qi)__anyext128(__m2)));
	}			}

	/// Adds each 16-bit signed integer element of the first 64-bit integer			/// Adds each 16-bit signed integer element of the first 64-bit integer
	/// vector of [4 x i16] to the corresponding 16-bit signed integer element of			/// vector of [4 x i16] to the corresponding 16-bit signed integer element of
	/// the second 64-bit integer vector of [4 x i16]. Positive sums greater than			/// the second 64-bit integer vector of [4 x i16]. Positive sums greater than
	/// 0x7FFF are saturated to 0x7FFF. Negative sums less than 0x8000 are			/// 0x7FFF are saturated to 0x7FFF. Negative sums less than 0x8000 are
	/// saturated to 0x8000. The results are packed into a 64-bit integer vector			/// saturated to 0x8000. The results are packed into a 64-bit integer vector
	/// of [4 x i16].			/// of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PADDSW </c> instruction.			/// This intrinsic corresponds to the <c> PADDSW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \returns A 64-bit integer vector of [4 x i16] containing the saturated sums			/// \returns A 64-bit integer vector of [4 x i16] containing the saturated sums
	/// of both parameters.			/// of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_adds_pi16(__m64 __m1, __m64 __m2)			_mm_adds_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_paddsw((__v4hi)__m1, (__v4hi)__m2);			return __trunc64(__builtin_ia32_paddsw128((__v8hi)__anyext128(__m1),
				(__v8hi)__anyext128(__m2)));
	}			}

	/// Adds each 8-bit unsigned integer element of the first 64-bit integer			/// Adds each 8-bit unsigned integer element of the first 64-bit integer
	/// vector of [8 x i8] to the corresponding 8-bit unsigned integer element of			/// vector of [8 x i8] to the corresponding 8-bit unsigned integer element of
	/// the second 64-bit integer vector of [8 x i8]. Sums greater than 0xFF are			/// the second 64-bit integer vector of [8 x i8]. Sums greater than 0xFF are
	/// saturated to 0xFF. The results are packed into a 64-bit integer vector of			/// saturated to 0xFF. The results are packed into a 64-bit integer vector of
	/// [8 x i8].			/// [8 x i8].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PADDUSB </c> instruction.			/// This intrinsic corresponds to the <c> PADDUSB </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// \returns A 64-bit integer vector of [8 x i8] containing the saturated			/// \returns A 64-bit integer vector of [8 x i8] containing the saturated
	/// unsigned sums of both parameters.			/// unsigned sums of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_adds_pu8(__m64 __m1, __m64 __m2)			_mm_adds_pu8(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_paddusb((__v8qi)__m1, (__v8qi)__m2);			return __trunc64(__builtin_ia32_paddusb128((__v16qi)__anyext128(__m1),
				(__v16qi)__anyext128(__m2)));
	}			}

	/// Adds each 16-bit unsigned integer element of the first 64-bit integer			/// Adds each 16-bit unsigned integer element of the first 64-bit integer
	/// vector of [4 x i16] to the corresponding 16-bit unsigned integer element			/// vector of [4 x i16] to the corresponding 16-bit unsigned integer element
	/// of the second 64-bit integer vector of [4 x i16]. Sums greater than			/// of the second 64-bit integer vector of [4 x i16]. Sums greater than
	/// 0xFFFF are saturated to 0xFFFF. The results are packed into a 64-bit			/// 0xFFFF are saturated to 0xFFFF. The results are packed into a 64-bit
	/// integer vector of [4 x i16].			/// integer vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PADDUSW </c> instruction.			/// This intrinsic corresponds to the <c> PADDUSW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \returns A 64-bit integer vector of [4 x i16] containing the saturated			/// \returns A 64-bit integer vector of [4 x i16] containing the saturated
	/// unsigned sums of both parameters.			/// unsigned sums of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_adds_pu16(__m64 __m1, __m64 __m2)			_mm_adds_pu16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_paddusw((__v4hi)__m1, (__v4hi)__m2);			return __trunc64(__builtin_ia32_paddusw128((__v8hi)__anyext128(__m1),
				(__v8hi)__anyext128(__m2)));
	}			}

	/// Subtracts each 8-bit integer element of the second 64-bit integer			/// Subtracts each 8-bit integer element of the second 64-bit integer
	/// vector of [8 x i8] from the corresponding 8-bit integer element of the			/// vector of [8 x i8] from the corresponding 8-bit integer element of the
	/// first 64-bit integer vector of [8 x i8]. The lower 8 bits of the results			/// first 64-bit integer vector of [8 x i8]. The lower 8 bits of the results
	/// are packed into a 64-bit integer vector of [8 x i8].			/// are packed into a 64-bit integer vector of [8 x i8].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSUBB </c> instruction.			/// This intrinsic corresponds to the <c> PSUBB </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [8 x i8] containing the minuends.			/// A 64-bit integer vector of [8 x i8] containing the minuends.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [8 x i8] containing the subtrahends.			/// A 64-bit integer vector of [8 x i8] containing the subtrahends.
	/// \returns A 64-bit integer vector of [8 x i8] containing the differences of			/// \returns A 64-bit integer vector of [8 x i8] containing the differences of
	/// both parameters.			/// both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_sub_pi8(__m64 __m1, __m64 __m2)			_mm_sub_pi8(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_psubb((__v8qi)__m1, (__v8qi)__m2);			return (__m64)(((__v8qu)__m1) - ((__v8qu)__m2));
	}			}

	/// Subtracts each 16-bit integer element of the second 64-bit integer			/// Subtracts each 16-bit integer element of the second 64-bit integer
	/// vector of [4 x i16] from the corresponding 16-bit integer element of the			/// vector of [4 x i16] from the corresponding 16-bit integer element of the
	/// first 64-bit integer vector of [4 x i16]. The lower 16 bits of the			/// first 64-bit integer vector of [4 x i16]. The lower 16 bits of the
	/// results are packed into a 64-bit integer vector of [4 x i16].			/// results are packed into a 64-bit integer vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSUBW </c> instruction.			/// This intrinsic corresponds to the <c> PSUBW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16] containing the minuends.			/// A 64-bit integer vector of [4 x i16] containing the minuends.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16] containing the subtrahends.			/// A 64-bit integer vector of [4 x i16] containing the subtrahends.
	/// \returns A 64-bit integer vector of [4 x i16] containing the differences of			/// \returns A 64-bit integer vector of [4 x i16] containing the differences of
	/// both parameters.			/// both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_sub_pi16(__m64 __m1, __m64 __m2)			_mm_sub_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_psubw((__v4hi)__m1, (__v4hi)__m2);			return (__m64)(((__v4hu)__m1) - ((__v4hu)__m2));
	}			}

	/// Subtracts each 32-bit integer element of the second 64-bit integer			/// Subtracts each 32-bit integer element of the second 64-bit integer
	/// vector of [2 x i32] from the corresponding 32-bit integer element of the			/// vector of [2 x i32] from the corresponding 32-bit integer element of the
	/// first 64-bit integer vector of [2 x i32]. The lower 32 bits of the			/// first 64-bit integer vector of [2 x i32]. The lower 32 bits of the
	/// results are packed into a 64-bit integer vector of [2 x i32].			/// results are packed into a 64-bit integer vector of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSUBD </c> instruction.			/// This intrinsic corresponds to the <c> PSUBD </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [2 x i32] containing the minuends.			/// A 64-bit integer vector of [2 x i32] containing the minuends.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [2 x i32] containing the subtrahends.			/// A 64-bit integer vector of [2 x i32] containing the subtrahends.
	/// \returns A 64-bit integer vector of [2 x i32] containing the differences of			/// \returns A 64-bit integer vector of [2 x i32] containing the differences of
	/// both parameters.			/// both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_sub_pi32(__m64 __m1, __m64 __m2)			_mm_sub_pi32(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_psubd((__v2si)__m1, (__v2si)__m2);			return (__m64)(((__v2su)__m1) - ((__v2su)__m2));
	}			}

	/// Subtracts each 8-bit signed integer element of the second 64-bit			/// Subtracts each 8-bit signed integer element of the second 64-bit
	/// integer vector of [8 x i8] from the corresponding 8-bit signed integer			/// integer vector of [8 x i8] from the corresponding 8-bit signed integer
	/// element of the first 64-bit integer vector of [8 x i8]. Positive results			/// element of the first 64-bit integer vector of [8 x i8]. Positive results
	/// greater than 0x7F are saturated to 0x7F. Negative results less than 0x80			/// greater than 0x7F are saturated to 0x7F. Negative results less than 0x80
	/// are saturated to 0x80. The results are packed into a 64-bit integer			/// are saturated to 0x80. The results are packed into a 64-bit integer
	/// vector of [8 x i8].			/// vector of [8 x i8].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSUBSB </c> instruction.			/// This intrinsic corresponds to the <c> PSUBSB </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [8 x i8] containing the minuends.			/// A 64-bit integer vector of [8 x i8] containing the minuends.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [8 x i8] containing the subtrahends.			/// A 64-bit integer vector of [8 x i8] containing the subtrahends.
	/// \returns A 64-bit integer vector of [8 x i8] containing the saturated			/// \returns A 64-bit integer vector of [8 x i8] containing the saturated
	/// differences of both parameters.			/// differences of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_subs_pi8(__m64 __m1, __m64 __m2)			_mm_subs_pi8(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_psubsb((__v8qi)__m1, (__v8qi)__m2);			return __trunc64(__builtin_ia32_psubsb128((__v16qi)__anyext128(__m1),
				(__v16qi)__anyext128(__m2)));
	}			}

	/// Subtracts each 16-bit signed integer element of the second 64-bit			/// Subtracts each 16-bit signed integer element of the second 64-bit
	/// integer vector of [4 x i16] from the corresponding 16-bit signed integer			/// integer vector of [4 x i16] from the corresponding 16-bit signed integer
	/// element of the first 64-bit integer vector of [4 x i16]. Positive results			/// element of the first 64-bit integer vector of [4 x i16]. Positive results
	/// greater than 0x7FFF are saturated to 0x7FFF. Negative results less than			/// greater than 0x7FFF are saturated to 0x7FFF. Negative results less than
	/// 0x8000 are saturated to 0x8000. The results are packed into a 64-bit			/// 0x8000 are saturated to 0x8000. The results are packed into a 64-bit
	/// integer vector of [4 x i16].			/// integer vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSUBSW </c> instruction.			/// This intrinsic corresponds to the <c> PSUBSW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16] containing the minuends.			/// A 64-bit integer vector of [4 x i16] containing the minuends.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16] containing the subtrahends.			/// A 64-bit integer vector of [4 x i16] containing the subtrahends.
	/// \returns A 64-bit integer vector of [4 x i16] containing the saturated			/// \returns A 64-bit integer vector of [4 x i16] containing the saturated
	/// differences of both parameters.			/// differences of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_subs_pi16(__m64 __m1, __m64 __m2)			_mm_subs_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_psubsw((__v4hi)__m1, (__v4hi)__m2);			return __trunc64(__builtin_ia32_psubsw128((__v8hi)__anyext128(__m1),
				(__v8hi)__anyext128(__m2)));
	}			}

	/// Subtracts each 8-bit unsigned integer element of the second 64-bit			/// Subtracts each 8-bit unsigned integer element of the second 64-bit
	/// integer vector of [8 x i8] from the corresponding 8-bit unsigned integer			/// integer vector of [8 x i8] from the corresponding 8-bit unsigned integer
	/// element of the first 64-bit integer vector of [8 x i8].			/// element of the first 64-bit integer vector of [8 x i8].
	///			///
	/// If an element of the first vector is less than the corresponding element			/// If an element of the first vector is less than the corresponding element
	/// of the second vector, the result is saturated to 0. The results are			/// of the second vector, the result is saturated to 0. The results are
	/// packed into a 64-bit integer vector of [8 x i8].			/// packed into a 64-bit integer vector of [8 x i8].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSUBUSB </c> instruction.			/// This intrinsic corresponds to the <c> PSUBUSB </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [8 x i8] containing the minuends.			/// A 64-bit integer vector of [8 x i8] containing the minuends.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [8 x i8] containing the subtrahends.			/// A 64-bit integer vector of [8 x i8] containing the subtrahends.
	/// \returns A 64-bit integer vector of [8 x i8] containing the saturated			/// \returns A 64-bit integer vector of [8 x i8] containing the saturated
	/// differences of both parameters.			/// differences of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_subs_pu8(__m64 __m1, __m64 __m2)			_mm_subs_pu8(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_psubusb((__v8qi)__m1, (__v8qi)__m2);			return __trunc64(__builtin_ia32_psubusb128((__v16qi)__anyext128(__m1),
				(__v16qi)__anyext128(__m2)));
	}			}

	/// Subtracts each 16-bit unsigned integer element of the second 64-bit			/// Subtracts each 16-bit unsigned integer element of the second 64-bit
	/// integer vector of [4 x i16] from the corresponding 16-bit unsigned			/// integer vector of [4 x i16] from the corresponding 16-bit unsigned
	/// integer element of the first 64-bit integer vector of [4 x i16].			/// integer element of the first 64-bit integer vector of [4 x i16].
	///			///
	/// If an element of the first vector is less than the corresponding element			/// If an element of the first vector is less than the corresponding element
	/// of the second vector, the result is saturated to 0. The results are			/// of the second vector, the result is saturated to 0. The results are
	/// packed into a 64-bit integer vector of [4 x i16].			/// packed into a 64-bit integer vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSUBUSW </c> instruction.			/// This intrinsic corresponds to the <c> PSUBUSW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16] containing the minuends.			/// A 64-bit integer vector of [4 x i16] containing the minuends.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16] containing the subtrahends.			/// A 64-bit integer vector of [4 x i16] containing the subtrahends.
	/// \returns A 64-bit integer vector of [4 x i16] containing the saturated			/// \returns A 64-bit integer vector of [4 x i16] containing the saturated
	/// differences of both parameters.			/// differences of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_subs_pu16(__m64 __m1, __m64 __m2)			_mm_subs_pu16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_psubusw((__v4hi)__m1, (__v4hi)__m2);			return __trunc64(__builtin_ia32_psubusw128((__v8hi)__anyext128(__m1),
				(__v8hi)__anyext128(__m2)));
	}			}

	/// Multiplies each 16-bit signed integer element of the first 64-bit			/// Multiplies each 16-bit signed integer element of the first 64-bit
	/// integer vector of [4 x i16] by the corresponding 16-bit signed integer			/// integer vector of [4 x i16] by the corresponding 16-bit signed integer
	/// element of the second 64-bit integer vector of [4 x i16] and get four			/// element of the second 64-bit integer vector of [4 x i16] and get four
	/// 32-bit products. Adds adjacent pairs of products to get two 32-bit sums.			/// 32-bit products. Adds adjacent pairs of products to get two 32-bit sums.
	/// The lower 32 bits of these two sums are packed into a 64-bit integer			/// The lower 32 bits of these two sums are packed into a 64-bit integer
	/// vector of [2 x i32].			/// vector of [2 x i32].
	///			///
	/// For example, bits [15:0] of both parameters are multiplied, bits [31:16]			/// For example, bits [15:0] of both parameters are multiplied, bits [31:16]
	/// of both parameters are multiplied, and the sum of both results is written			/// of both parameters are multiplied, and the sum of both results is written
	/// to bits [31:0] of the result.			/// to bits [31:0] of the result.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PMADDWD </c> instruction.			/// This intrinsic corresponds to the <c> PMADDWD </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \returns A 64-bit integer vector of [2 x i32] containing the sums of			/// \returns A 64-bit integer vector of [2 x i32] containing the sums of
	/// products of both parameters.			/// products of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_madd_pi16(__m64 __m1, __m64 __m2)			_mm_madd_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_pmaddwd((__v4hi)__m1, (__v4hi)__m2);			return __trunc64(__builtin_ia32_pmaddwd128((__v8hi)__anyext128(__m1),
				(__v8hi)__anyext128(__m2)));
	}			}

	/// Multiplies each 16-bit signed integer element of the first 64-bit			/// Multiplies each 16-bit signed integer element of the first 64-bit
	/// integer vector of [4 x i16] by the corresponding 16-bit signed integer			/// integer vector of [4 x i16] by the corresponding 16-bit signed integer
	/// element of the second 64-bit integer vector of [4 x i16]. Packs the upper			/// element of the second 64-bit integer vector of [4 x i16]. Packs the upper
	/// 16 bits of the 32-bit products into a 64-bit integer vector of [4 x i16].			/// 16 bits of the 32-bit products into a 64-bit integer vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PMULHW </c> instruction.			/// This intrinsic corresponds to the <c> PMULHW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \returns A 64-bit integer vector of [4 x i16] containing the upper 16 bits			/// \returns A 64-bit integer vector of [4 x i16] containing the upper 16 bits
	/// of the products of both parameters.			/// of the products of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_mulhi_pi16(__m64 __m1, __m64 __m2)			_mm_mulhi_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_pmulhw((__v4hi)__m1, (__v4hi)__m2);			return __trunc64(__builtin_ia32_pmulhw128((__v8hi)__anyext128(__m1),
				(__v8hi)__anyext128(__m2)));
	}			}

	/// Multiplies each 16-bit signed integer element of the first 64-bit			/// Multiplies each 16-bit signed integer element of the first 64-bit
	/// integer vector of [4 x i16] by the corresponding 16-bit signed integer			/// integer vector of [4 x i16] by the corresponding 16-bit signed integer
	/// element of the second 64-bit integer vector of [4 x i16]. Packs the lower			/// element of the second 64-bit integer vector of [4 x i16]. Packs the lower
	/// 16 bits of the 32-bit products into a 64-bit integer vector of [4 x i16].			/// 16 bits of the 32-bit products into a 64-bit integer vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PMULLW </c> instruction.			/// This intrinsic corresponds to the <c> PMULLW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \returns A 64-bit integer vector of [4 x i16] containing the lower 16 bits			/// \returns A 64-bit integer vector of [4 x i16] containing the lower 16 bits
	/// of the products of both parameters.			/// of the products of both parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_mullo_pi16(__m64 __m1, __m64 __m2)			_mm_mullo_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_pmullw((__v4hi)__m1, (__v4hi)__m2);			return (__m64)(((__v4hu)__m1) * ((__v4hu)__m2));
	}			}

	/// Left-shifts each 16-bit signed integer element of the first			/// Left-shifts each 16-bit signed integer element of the first
	/// parameter, which is a 64-bit integer vector of [4 x i16], by the number			/// parameter, which is a 64-bit integer vector of [4 x i16], by the number
	/// of bits specified by the second parameter, which is a 64-bit integer. The			/// of bits specified by the second parameter, which is a 64-bit integer. The
	/// lower 16 bits of the results are packed into a 64-bit integer vector of			/// lower 16 bits of the results are packed into a 64-bit integer vector of
	/// [4 x i16].			/// [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSLLW </c> instruction.			/// This intrinsic corresponds to the <c> PSLLW </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __count			/// \param __count
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \returns A 64-bit integer vector of [4 x i16] containing the left-shifted			/// \returns A 64-bit integer vector of [4 x i16] containing the left-shifted
	/// values. If \a __count is greater or equal to 16, the result is set to all			/// values. If \a __count is greater or equal to 16, the result is set to all
	/// 0.			/// 0.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_sll_pi16(__m64 __m, __m64 __count)			_mm_sll_pi16(__m64 __m, __m64 __count)
	{			{
	return (__m64)__builtin_ia32_psllw((__v4hi)__m, __count);			return __trunc64(__builtin_ia32_psllw128((__v8hi)__anyext128(__m),
				(__v8hi)__anyext128(__count)));
	}			}

	/// Left-shifts each 16-bit signed integer element of a 64-bit integer			/// Left-shifts each 16-bit signed integer element of a 64-bit integer
	/// vector of [4 x i16] by the number of bits specified by a 32-bit integer.			/// vector of [4 x i16] by the number of bits specified by a 32-bit integer.
	/// The lower 16 bits of the results are packed into a 64-bit integer vector			/// The lower 16 bits of the results are packed into a 64-bit integer vector
	/// of [4 x i16].			/// of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSLLW </c> instruction.			/// This intrinsic corresponds to the <c> PSLLW </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __count			/// \param __count
	/// A 32-bit integer value.			/// A 32-bit integer value.
	/// \returns A 64-bit integer vector of [4 x i16] containing the left-shifted			/// \returns A 64-bit integer vector of [4 x i16] containing the left-shifted
	/// values. If \a __count is greater or equal to 16, the result is set to all			/// values. If \a __count is greater or equal to 16, the result is set to all
	/// 0.			/// 0.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_slli_pi16(__m64 __m, int __count)			_mm_slli_pi16(__m64 __m, int __count)
	{			{
	return (__m64)__builtin_ia32_psllwi((__v4hi)__m, __count);			return __trunc64(__builtin_ia32_psllwi128((__v8hi)__anyext128(__m),
				__count));
	}			}

	/// Left-shifts each 32-bit signed integer element of the first			/// Left-shifts each 32-bit signed integer element of the first
	/// parameter, which is a 64-bit integer vector of [2 x i32], by the number			/// parameter, which is a 64-bit integer vector of [2 x i32], by the number
	/// of bits specified by the second parameter, which is a 64-bit integer. The			/// of bits specified by the second parameter, which is a 64-bit integer. The
	/// lower 32 bits of the results are packed into a 64-bit integer vector of			/// lower 32 bits of the results are packed into a 64-bit integer vector of
	/// [2 x i32].			/// [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSLLD </c> instruction.			/// This intrinsic corresponds to the <c> PSLLD </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \param __count			/// \param __count
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \returns A 64-bit integer vector of [2 x i32] containing the left-shifted			/// \returns A 64-bit integer vector of [2 x i32] containing the left-shifted
	/// values. If \a __count is greater or equal to 32, the result is set to all			/// values. If \a __count is greater or equal to 32, the result is set to all
	/// 0.			/// 0.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_sll_pi32(__m64 __m, __m64 __count)			_mm_sll_pi32(__m64 __m, __m64 __count)
	{			{
	return (__m64)__builtin_ia32_pslld((__v2si)__m, __count);			return __trunc64(__builtin_ia32_pslld128((__v4si)__anyext128(__m),
				(__v4si)__anyext128(__count)));
	}			}

	/// Left-shifts each 32-bit signed integer element of a 64-bit integer			/// Left-shifts each 32-bit signed integer element of a 64-bit integer
	/// vector of [2 x i32] by the number of bits specified by a 32-bit integer.			/// vector of [2 x i32] by the number of bits specified by a 32-bit integer.
	/// The lower 32 bits of the results are packed into a 64-bit integer vector			/// The lower 32 bits of the results are packed into a 64-bit integer vector
	/// of [2 x i32].			/// of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSLLD </c> instruction.			/// This intrinsic corresponds to the <c> PSLLD </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \param __count			/// \param __count
	/// A 32-bit integer value.			/// A 32-bit integer value.
	/// \returns A 64-bit integer vector of [2 x i32] containing the left-shifted			/// \returns A 64-bit integer vector of [2 x i32] containing the left-shifted
	/// values. If \a __count is greater or equal to 32, the result is set to all			/// values. If \a __count is greater or equal to 32, the result is set to all
	/// 0.			/// 0.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_slli_pi32(__m64 __m, int __count)			_mm_slli_pi32(__m64 __m, int __count)
	{			{
	return (__m64)__builtin_ia32_pslldi((__v2si)__m, __count);			return __trunc64(__builtin_ia32_pslldi128((__v4si)__anyext128(__m),
				__count));
	}			}

	/// Left-shifts the first 64-bit integer parameter by the number of bits			/// Left-shifts the first 64-bit integer parameter by the number of bits
	/// specified by the second 64-bit integer parameter. The lower 64 bits of			/// specified by the second 64-bit integer parameter. The lower 64 bits of
	/// result are returned.			/// result are returned.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSLLQ </c> instruction.			/// This intrinsic corresponds to the <c> PSLLQ </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \param __count			/// \param __count
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \returns A 64-bit integer vector containing the left-shifted value. If			/// \returns A 64-bit integer vector containing the left-shifted value. If
	/// \a __count is greater or equal to 64, the result is set to 0.			/// \a __count is greater or equal to 64, the result is set to 0.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_sll_si64(__m64 __m, __m64 __count)			_mm_sll_si64(__m64 __m, __m64 __count)
	{			{
	return (__m64)__builtin_ia32_psllq((__v1di)__m, __count);			return __trunc64(__builtin_ia32_psllq128((__v2di)__anyext128(__m),
				__anyext128(__count)));
	}			}

	/// Left-shifts the first parameter, which is a 64-bit integer, by the			/// Left-shifts the first parameter, which is a 64-bit integer, by the
	/// number of bits specified by the second parameter, which is a 32-bit			/// number of bits specified by the second parameter, which is a 32-bit
	/// integer. The lower 64 bits of result are returned.			/// integer. The lower 64 bits of result are returned.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSLLQ </c> instruction.			/// This intrinsic corresponds to the <c> PSLLQ </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \param __count			/// \param __count
	/// A 32-bit integer value.			/// A 32-bit integer value.
	/// \returns A 64-bit integer vector containing the left-shifted value. If			/// \returns A 64-bit integer vector containing the left-shifted value. If
	/// \a __count is greater or equal to 64, the result is set to 0.			/// \a __count is greater or equal to 64, the result is set to 0.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_slli_si64(__m64 __m, int __count)			_mm_slli_si64(__m64 __m, int __count)
	{			{
	return (__m64)__builtin_ia32_psllqi((__v1di)__m, __count);			return __trunc64(__builtin_ia32_psllqi128((__v2di)__anyext128(__m),
				__count));
	}			}

	/// Right-shifts each 16-bit integer element of the first parameter,			/// Right-shifts each 16-bit integer element of the first parameter,
	/// which is a 64-bit integer vector of [4 x i16], by the number of bits			/// which is a 64-bit integer vector of [4 x i16], by the number of bits
	/// specified by the second parameter, which is a 64-bit integer.			/// specified by the second parameter, which is a 64-bit integer.
	///			///
	/// High-order bits are filled with the sign bit of the initial value of each			/// High-order bits are filled with the sign bit of the initial value of each
	/// 16-bit element. The 16-bit results are packed into a 64-bit integer			/// 16-bit element. The 16-bit results are packed into a 64-bit integer
	/// vector of [4 x i16].			/// vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSRAW </c> instruction.			/// This intrinsic corresponds to the <c> PSRAW </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __count			/// \param __count
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \returns A 64-bit integer vector of [4 x i16] containing the right-shifted			/// \returns A 64-bit integer vector of [4 x i16] containing the right-shifted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_sra_pi16(__m64 __m, __m64 __count)			_mm_sra_pi16(__m64 __m, __m64 __count)
	{			{
	return (__m64)__builtin_ia32_psraw((__v4hi)__m, __count);			return __trunc64(__builtin_ia32_psraw128((__v8hi)__anyext128(__m),
				(__v8hi)__anyext128(__count)));
	}			}

	/// Right-shifts each 16-bit integer element of a 64-bit integer vector			/// Right-shifts each 16-bit integer element of a 64-bit integer vector
	/// of [4 x i16] by the number of bits specified by a 32-bit integer.			/// of [4 x i16] by the number of bits specified by a 32-bit integer.
	///			///
	/// High-order bits are filled with the sign bit of the initial value of each			/// High-order bits are filled with the sign bit of the initial value of each
	/// 16-bit element. The 16-bit results are packed into a 64-bit integer			/// 16-bit element. The 16-bit results are packed into a 64-bit integer
	/// vector of [4 x i16].			/// vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSRAW </c> instruction.			/// This intrinsic corresponds to the <c> PSRAW </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __count			/// \param __count
	/// A 32-bit integer value.			/// A 32-bit integer value.
	/// \returns A 64-bit integer vector of [4 x i16] containing the right-shifted			/// \returns A 64-bit integer vector of [4 x i16] containing the right-shifted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_srai_pi16(__m64 __m, int __count)			_mm_srai_pi16(__m64 __m, int __count)
	{			{
	return (__m64)__builtin_ia32_psrawi((__v4hi)__m, __count);			return __trunc64(__builtin_ia32_psrawi128((__v8hi)__anyext128(__m),
				__count));
	}			}

	/// Right-shifts each 32-bit integer element of the first parameter,			/// Right-shifts each 32-bit integer element of the first parameter,
	/// which is a 64-bit integer vector of [2 x i32], by the number of bits			/// which is a 64-bit integer vector of [2 x i32], by the number of bits
	/// specified by the second parameter, which is a 64-bit integer.			/// specified by the second parameter, which is a 64-bit integer.
	///			///
	/// High-order bits are filled with the sign bit of the initial value of each			/// High-order bits are filled with the sign bit of the initial value of each
	/// 32-bit element. The 32-bit results are packed into a 64-bit integer			/// 32-bit element. The 32-bit results are packed into a 64-bit integer
	/// vector of [2 x i32].			/// vector of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSRAD </c> instruction.			/// This intrinsic corresponds to the <c> PSRAD </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \param __count			/// \param __count
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \returns A 64-bit integer vector of [2 x i32] containing the right-shifted			/// \returns A 64-bit integer vector of [2 x i32] containing the right-shifted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_sra_pi32(__m64 __m, __m64 __count)			_mm_sra_pi32(__m64 __m, __m64 __count)
	{			{
	return (__m64)__builtin_ia32_psrad((__v2si)__m, __count);			return __trunc64(__builtin_ia32_psrad128((__v4si)__anyext128(__m),
				(__v4si)__anyext128(__count)));
	}			}

	/// Right-shifts each 32-bit integer element of a 64-bit integer vector			/// Right-shifts each 32-bit integer element of a 64-bit integer vector
	/// of [2 x i32] by the number of bits specified by a 32-bit integer.			/// of [2 x i32] by the number of bits specified by a 32-bit integer.
	///			///
	/// High-order bits are filled with the sign bit of the initial value of each			/// High-order bits are filled with the sign bit of the initial value of each
	/// 32-bit element. The 32-bit results are packed into a 64-bit integer			/// 32-bit element. The 32-bit results are packed into a 64-bit integer
	/// vector of [2 x i32].			/// vector of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSRAD </c> instruction.			/// This intrinsic corresponds to the <c> PSRAD </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \param __count			/// \param __count
	/// A 32-bit integer value.			/// A 32-bit integer value.
	/// \returns A 64-bit integer vector of [2 x i32] containing the right-shifted			/// \returns A 64-bit integer vector of [2 x i32] containing the right-shifted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_srai_pi32(__m64 __m, int __count)			_mm_srai_pi32(__m64 __m, int __count)
	{			{
	return (__m64)__builtin_ia32_psradi((__v2si)__m, __count);			return __trunc64(__builtin_ia32_psradi128((__v4si)__anyext128(__m),
				__count));
	}			}

	/// Right-shifts each 16-bit integer element of the first parameter,			/// Right-shifts each 16-bit integer element of the first parameter,
	/// which is a 64-bit integer vector of [4 x i16], by the number of bits			/// which is a 64-bit integer vector of [4 x i16], by the number of bits
	/// specified by the second parameter, which is a 64-bit integer.			/// specified by the second parameter, which is a 64-bit integer.
	///			///
	/// High-order bits are cleared. The 16-bit results are packed into a 64-bit			/// High-order bits are cleared. The 16-bit results are packed into a 64-bit
	/// integer vector of [4 x i16].			/// integer vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSRLW </c> instruction.			/// This intrinsic corresponds to the <c> PSRLW </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __count			/// \param __count
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \returns A 64-bit integer vector of [4 x i16] containing the right-shifted			/// \returns A 64-bit integer vector of [4 x i16] containing the right-shifted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_srl_pi16(__m64 __m, __m64 __count)			_mm_srl_pi16(__m64 __m, __m64 __count)
	{			{
	return (__m64)__builtin_ia32_psrlw((__v4hi)__m, __count);			return __trunc64(__builtin_ia32_psrlw128((__v8hi)__anyext128(__m),
				(__v8hi)__anyext128(__count)));
	}			}

	/// Right-shifts each 16-bit integer element of a 64-bit integer vector			/// Right-shifts each 16-bit integer element of a 64-bit integer vector
	/// of [4 x i16] by the number of bits specified by a 32-bit integer.			/// of [4 x i16] by the number of bits specified by a 32-bit integer.
	///			///
	/// High-order bits are cleared. The 16-bit results are packed into a 64-bit			/// High-order bits are cleared. The 16-bit results are packed into a 64-bit
	/// integer vector of [4 x i16].			/// integer vector of [4 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSRLW </c> instruction.			/// This intrinsic corresponds to the <c> PSRLW </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __count			/// \param __count
	/// A 32-bit integer value.			/// A 32-bit integer value.
	/// \returns A 64-bit integer vector of [4 x i16] containing the right-shifted			/// \returns A 64-bit integer vector of [4 x i16] containing the right-shifted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_srli_pi16(__m64 __m, int __count)			_mm_srli_pi16(__m64 __m, int __count)
	{			{
	return (__m64)__builtin_ia32_psrlwi((__v4hi)__m, __count);			return __trunc64(__builtin_ia32_psrlwi128((__v8hi)__anyext128(__m),
				__count));
	}			}

	/// Right-shifts each 32-bit integer element of the first parameter,			/// Right-shifts each 32-bit integer element of the first parameter,
	/// which is a 64-bit integer vector of [2 x i32], by the number of bits			/// which is a 64-bit integer vector of [2 x i32], by the number of bits
	/// specified by the second parameter, which is a 64-bit integer.			/// specified by the second parameter, which is a 64-bit integer.
	///			///
	/// High-order bits are cleared. The 32-bit results are packed into a 64-bit			/// High-order bits are cleared. The 32-bit results are packed into a 64-bit
	/// integer vector of [2 x i32].			/// integer vector of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSRLD </c> instruction.			/// This intrinsic corresponds to the <c> PSRLD </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \param __count			/// \param __count
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \returns A 64-bit integer vector of [2 x i32] containing the right-shifted			/// \returns A 64-bit integer vector of [2 x i32] containing the right-shifted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_srl_pi32(__m64 __m, __m64 __count)			_mm_srl_pi32(__m64 __m, __m64 __count)
	{			{
	return (__m64)__builtin_ia32_psrld((__v2si)__m, __count);			return __trunc64(__builtin_ia32_psrld128((__v4si)__anyext128(__m),
				(__v4si)__anyext128(__count)));
	}			}

	/// Right-shifts each 32-bit integer element of a 64-bit integer vector			/// Right-shifts each 32-bit integer element of a 64-bit integer vector
	/// of [2 x i32] by the number of bits specified by a 32-bit integer.			/// of [2 x i32] by the number of bits specified by a 32-bit integer.
	///			///
	/// High-order bits are cleared. The 32-bit results are packed into a 64-bit			/// High-order bits are cleared. The 32-bit results are packed into a 64-bit
	/// integer vector of [2 x i32].			/// integer vector of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSRLD </c> instruction.			/// This intrinsic corresponds to the <c> PSRLD </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \param __count			/// \param __count
	/// A 32-bit integer value.			/// A 32-bit integer value.
	/// \returns A 64-bit integer vector of [2 x i32] containing the right-shifted			/// \returns A 64-bit integer vector of [2 x i32] containing the right-shifted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_srli_pi32(__m64 __m, int __count)			_mm_srli_pi32(__m64 __m, int __count)
	{			{
	return (__m64)__builtin_ia32_psrldi((__v2si)__m, __count);			return __trunc64(__builtin_ia32_psrldi128((__v4si)__anyext128(__m),
				__count));
	}			}

	/// Right-shifts the first 64-bit integer parameter by the number of bits			/// Right-shifts the first 64-bit integer parameter by the number of bits
	/// specified by the second 64-bit integer parameter.			/// specified by the second 64-bit integer parameter.
	///			///
	/// High-order bits are cleared.			/// High-order bits are cleared.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSRLQ </c> instruction.			/// This intrinsic corresponds to the <c> PSRLQ </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \param __count			/// \param __count
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \returns A 64-bit integer vector containing the right-shifted value.			/// \returns A 64-bit integer vector containing the right-shifted value.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_srl_si64(__m64 __m, __m64 __count)			_mm_srl_si64(__m64 __m, __m64 __count)
	{			{
	return (__m64)__builtin_ia32_psrlq((__v1di)__m, __count);			return __trunc64(__builtin_ia32_psrlq128((__v2di)__anyext128(__m),
				__anyext128(__count)));
	}			}

	/// Right-shifts the first parameter, which is a 64-bit integer, by the			/// Right-shifts the first parameter, which is a 64-bit integer, by the
	/// number of bits specified by the second parameter, which is a 32-bit			/// number of bits specified by the second parameter, which is a 32-bit
	/// integer.			/// integer.
	///			///
	/// High-order bits are cleared.			/// High-order bits are cleared.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSRLQ </c> instruction.			/// This intrinsic corresponds to the <c> PSRLQ </c> instruction.
	///			///
	/// \param __m			/// \param __m
	/// A 64-bit integer vector interpreted as a single 64-bit integer.			/// A 64-bit integer vector interpreted as a single 64-bit integer.
	/// \param __count			/// \param __count
	/// A 32-bit integer value.			/// A 32-bit integer value.
	/// \returns A 64-bit integer vector containing the right-shifted value.			/// \returns A 64-bit integer vector containing the right-shifted value.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_srli_si64(__m64 __m, int __count)			_mm_srli_si64(__m64 __m, int __count)
	{			{
	return (__m64)__builtin_ia32_psrlqi((__v1di)__m, __count);			return __trunc64(__builtin_ia32_psrlqi128((__v2di)__anyext128(__m),
				__count));
	}			}

	/// Performs a bitwise AND of two 64-bit integer vectors.			/// Performs a bitwise AND of two 64-bit integer vectors.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PAND </c> instruction.			/// This intrinsic corresponds to the <c> PAND </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector.			/// A 64-bit integer vector.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector.			/// A 64-bit integer vector.
	/// \returns A 64-bit integer vector containing the bitwise AND of both			/// \returns A 64-bit integer vector containing the bitwise AND of both
	/// parameters.			/// parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_and_si64(__m64 __m1, __m64 __m2)			_mm_and_si64(__m64 __m1, __m64 __m2)
	{			{
	return __builtin_ia32_pand((__v1di)__m1, (__v1di)__m2);			return (__m64)(((__v1du)__m1) & ((__v1du)__m2));
				craig.topperUnsubmitted Not Done Reply Inline Actions I think we probably want to use a v2su or v2si here. Using v1di scalarizes and splits on 32-bit targets. On 64-bit targets it emits GPR code. craig.topper: I think we probably want to use a v2su or v2si here. Using v1di scalarizes and splits on 32-bit…
				jyknightAuthorUnsubmitted Not Done Reply Inline Actions AFAICT, this doesn't matter? It seems to emit GPR or XMM code just depending on whether the result values are needed as XMM or not, independent of whether the type is specified as v2su or v1du. jyknight: AFAICT, this doesn't matter? It seems to emit GPR or XMM code just depending on whether the…
	}			}

	/// Performs a bitwise NOT of the first 64-bit integer vector, and then			/// Performs a bitwise NOT of the first 64-bit integer vector, and then
	/// performs a bitwise AND of the intermediate result and the second 64-bit			/// performs a bitwise AND of the intermediate result and the second 64-bit
	/// integer vector.			/// integer vector.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PANDN </c> instruction.			/// This intrinsic corresponds to the <c> PANDN </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector. The one's complement of this parameter is used			/// A 64-bit integer vector. The one's complement of this parameter is used
	/// in the bitwise AND.			/// in the bitwise AND.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector.			/// A 64-bit integer vector.
	/// \returns A 64-bit integer vector containing the bitwise AND of the second			/// \returns A 64-bit integer vector containing the bitwise AND of the second
	/// parameter and the one's complement of the first parameter.			/// parameter and the one's complement of the first parameter.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_andnot_si64(__m64 __m1, __m64 __m2)			_mm_andnot_si64(__m64 __m1, __m64 __m2)
	{			{
	return __builtin_ia32_pandn((__v1di)__m1, (__v1di)__m2);			return (__m64)(~((__v1du)__m1) & ((__v1du)__m2));
	}			}

	/// Performs a bitwise OR of two 64-bit integer vectors.			/// Performs a bitwise OR of two 64-bit integer vectors.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> POR </c> instruction.			/// This intrinsic corresponds to the <c> POR </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector.			/// A 64-bit integer vector.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector.			/// A 64-bit integer vector.
	/// \returns A 64-bit integer vector containing the bitwise OR of both			/// \returns A 64-bit integer vector containing the bitwise OR of both
	/// parameters.			/// parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_or_si64(__m64 __m1, __m64 __m2)			_mm_or_si64(__m64 __m1, __m64 __m2)
	{			{
	return __builtin_ia32_por((__v1di)__m1, (__v1di)__m2);			return (__m64)(((__v1du)__m1) \| ((__v1du)__m2));
	}			}

	/// Performs a bitwise exclusive OR of two 64-bit integer vectors.			/// Performs a bitwise exclusive OR of two 64-bit integer vectors.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PXOR </c> instruction.			/// This intrinsic corresponds to the <c> PXOR </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector.			/// A 64-bit integer vector.
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector.			/// A 64-bit integer vector.
	/// \returns A 64-bit integer vector containing the bitwise exclusive OR of both			/// \returns A 64-bit integer vector containing the bitwise exclusive OR of both
	/// parameters.			/// parameters.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_xor_si64(__m64 __m1, __m64 __m2)			_mm_xor_si64(__m64 __m1, __m64 __m2)
	{			{
	return __builtin_ia32_pxor((__v1di)__m1, (__v1di)__m2);			return (__m64)(((__v1du)__m1) ^ ((__v1du)__m2));
	}			}

	/// Compares the 8-bit integer elements of two 64-bit integer vectors of			/// Compares the 8-bit integer elements of two 64-bit integer vectors of
	/// [8 x i8] to determine if the element of the first vector is equal to the			/// [8 x i8] to determine if the element of the first vector is equal to the
	/// corresponding element of the second vector.			/// corresponding element of the second vector.
	///			///
	/// The comparison yields 0 for false, 0xFF for true.			/// The comparison yields 0 for false, 0xFF for true.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PCMPEQB </c> instruction.			/// This intrinsic corresponds to the <c> PCMPEQB </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// \returns A 64-bit integer vector of [8 x i8] containing the comparison			/// \returns A 64-bit integer vector of [8 x i8] containing the comparison
	/// results.			/// results.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cmpeq_pi8(__m64 __m1, __m64 __m2)			_mm_cmpeq_pi8(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_pcmpeqb((__v8qi)__m1, (__v8qi)__m2);			return (__m64)(((__v8qi)__m1) == ((__v8qi)__m2));
	}			}

	/// Compares the 16-bit integer elements of two 64-bit integer vectors of			/// Compares the 16-bit integer elements of two 64-bit integer vectors of
	/// [4 x i16] to determine if the element of the first vector is equal to the			/// [4 x i16] to determine if the element of the first vector is equal to the
	/// corresponding element of the second vector.			/// corresponding element of the second vector.
	///			///
	/// The comparison yields 0 for false, 0xFFFF for true.			/// The comparison yields 0 for false, 0xFFFF for true.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PCMPEQW </c> instruction.			/// This intrinsic corresponds to the <c> PCMPEQW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \returns A 64-bit integer vector of [4 x i16] containing the comparison			/// \returns A 64-bit integer vector of [4 x i16] containing the comparison
	/// results.			/// results.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cmpeq_pi16(__m64 __m1, __m64 __m2)			_mm_cmpeq_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_pcmpeqw((__v4hi)__m1, (__v4hi)__m2);			return (__m64)(((__v4hi)__m1) == ((__v4hi)__m2));
	}			}

	/// Compares the 32-bit integer elements of two 64-bit integer vectors of			/// Compares the 32-bit integer elements of two 64-bit integer vectors of
	/// [2 x i32] to determine if the element of the first vector is equal to the			/// [2 x i32] to determine if the element of the first vector is equal to the
	/// corresponding element of the second vector.			/// corresponding element of the second vector.
	///			///
	/// The comparison yields 0 for false, 0xFFFFFFFF for true.			/// The comparison yields 0 for false, 0xFFFFFFFF for true.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PCMPEQD </c> instruction.			/// This intrinsic corresponds to the <c> PCMPEQD </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \returns A 64-bit integer vector of [2 x i32] containing the comparison			/// \returns A 64-bit integer vector of [2 x i32] containing the comparison
	/// results.			/// results.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cmpeq_pi32(__m64 __m1, __m64 __m2)			_mm_cmpeq_pi32(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_pcmpeqd((__v2si)__m1, (__v2si)__m2);			return (__m64)(((__v2si)__m1) == ((__v2si)__m2));
	}			}

	/// Compares the 8-bit integer elements of two 64-bit integer vectors of			/// Compares the 8-bit integer elements of two 64-bit integer vectors of
	/// [8 x i8] to determine if the element of the first vector is greater than			/// [8 x i8] to determine if the element of the first vector is greater than
	/// the corresponding element of the second vector.			/// the corresponding element of the second vector.
	///			///
	/// The comparison yields 0 for false, 0xFF for true.			/// The comparison yields 0 for false, 0xFF for true.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PCMPGTB </c> instruction.			/// This intrinsic corresponds to the <c> PCMPGTB </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [8 x i8].			/// A 64-bit integer vector of [8 x i8].
	/// \returns A 64-bit integer vector of [8 x i8] containing the comparison			/// \returns A 64-bit integer vector of [8 x i8] containing the comparison
	/// results.			/// results.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cmpgt_pi8(__m64 __m1, __m64 __m2)			_mm_cmpgt_pi8(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_pcmpgtb((__v8qi)__m1, (__v8qi)__m2);			/* This function always performs a signed comparison, but __v8qi is a char
				which may be signed or unsigned, so use __v8qs. */
				return (__m64)((__v8qs)__m1 > (__v8qs)__m2);
				craig.topperUnsubmitted Done Reply Inline Actions Need to use v8qs here to force "signed char" elements. v8qi uses "char" which has platform dependent signedness or can be changed with a command line. craig.topper: Need to use __v8qs here to force "signed char" elements. __v8qi uses "char" which has platform…
				jyknightAuthorUnsubmitted Done Reply Inline Actions Done. jyknight: Done.
	}			}

	/// Compares the 16-bit integer elements of two 64-bit integer vectors of			/// Compares the 16-bit integer elements of two 64-bit integer vectors of
	/// [4 x i16] to determine if the element of the first vector is greater than			/// [4 x i16] to determine if the element of the first vector is greater than
	/// the corresponding element of the second vector.			/// the corresponding element of the second vector.
	///			///
	/// The comparison yields 0 for false, 0xFFFF for true.			/// The comparison yields 0 for false, 0xFFFF for true.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PCMPGTW </c> instruction.			/// This intrinsic corresponds to the <c> PCMPGTW </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [4 x i16].			/// A 64-bit integer vector of [4 x i16].
	/// \returns A 64-bit integer vector of [4 x i16] containing the comparison			/// \returns A 64-bit integer vector of [4 x i16] containing the comparison
	/// results.			/// results.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cmpgt_pi16(__m64 __m1, __m64 __m2)			_mm_cmpgt_pi16(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_pcmpgtw((__v4hi)__m1, (__v4hi)__m2);			return (__m64)((__v4hi)__m1 > (__v4hi)__m2);
				craig.topperUnsubmitted Done Reply Inline Actions Same here craig.topper: Same here
				jyknightAuthorUnsubmitted Done Reply Inline Actions This is a short, which is always signed, so it should be ok as written. jyknight: This is a short, which is always signed, so it should be ok as written.
				craig.topperUnsubmitted Not Done Reply Inline Actions Yeah. I don't know why I wrote that now. craig.topper: Yeah. I don't know why I wrote that now.
	}			}

	/// Compares the 32-bit integer elements of two 64-bit integer vectors of			/// Compares the 32-bit integer elements of two 64-bit integer vectors of
	/// [2 x i32] to determine if the element of the first vector is greater than			/// [2 x i32] to determine if the element of the first vector is greater than
	/// the corresponding element of the second vector.			/// the corresponding element of the second vector.
	///			///
	/// The comparison yields 0 for false, 0xFFFFFFFF for true.			/// The comparison yields 0 for false, 0xFFFFFFFF for true.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PCMPGTD </c> instruction.			/// This intrinsic corresponds to the <c> PCMPGTD </c> instruction.
	///			///
	/// \param __m1			/// \param __m1
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \param __m2			/// \param __m2
	/// A 64-bit integer vector of [2 x i32].			/// A 64-bit integer vector of [2 x i32].
	/// \returns A 64-bit integer vector of [2 x i32] containing the comparison			/// \returns A 64-bit integer vector of [2 x i32] containing the comparison
	/// results.			/// results.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cmpgt_pi32(__m64 __m1, __m64 __m2)			_mm_cmpgt_pi32(__m64 __m1, __m64 __m2)
	{			{
	return (__m64)__builtin_ia32_pcmpgtd((__v2si)__m1, (__v2si)__m2);			return (__m64)((__v2si)__m1 > (__v2si)__m2);
	}			}

	/// Constructs a 64-bit integer vector initialized to zero.			/// Constructs a 64-bit integer vector initialized to zero.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PXOR </c> instruction.			/// This intrinsic corresponds to the <c> PXOR </c> instruction.
	///			///
	/// \returns An initialized 64-bit integer vector with all elements set to zero.			/// \returns An initialized 64-bit integer vector with all elements set to zero.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_setzero_si64(void)			_mm_setzero_si64(void)
	{			{
	return __extension__ (__m64){ 0LL };			return __extension__ (__m64){ 0LL };
	}			}

	/// Constructs a 64-bit integer vector initialized with the specified			/// Constructs a 64-bit integer vector initialized with the specified
	/// 32-bit integer values.			/// 32-bit integer values.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic is a utility function and does not correspond to a specific			/// This intrinsic is a utility function and does not correspond to a specific
	/// instruction.			/// instruction.
	///			///
	/// \param __i1			/// \param __i1
	/// A 32-bit integer value used to initialize the upper 32 bits of the			/// A 32-bit integer value used to initialize the upper 32 bits of the
	/// result.			/// result.
	/// \param __i0			/// \param __i0
	/// A 32-bit integer value used to initialize the lower 32 bits of the			/// A 32-bit integer value used to initialize the lower 32 bits of the
	/// result.			/// result.
	/// \returns An initialized 64-bit integer vector.			/// \returns An initialized 64-bit integer vector.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_set_pi32(int __i1, int __i0)			_mm_set_pi32(int __i1, int __i0)
	{			{
	return (__m64)__builtin_ia32_vec_init_v2si(__i0, __i1);			return __extension__ (__m64)(__v2si){__i0, __i1};
	}			}

	/// Constructs a 64-bit integer vector initialized with the specified			/// Constructs a 64-bit integer vector initialized with the specified
	/// 16-bit integer values.			/// 16-bit integer values.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic is a utility function and does not correspond to a specific			/// This intrinsic is a utility function and does not correspond to a specific
	/// instruction.			/// instruction.
	///			///
	/// \param __s3			/// \param __s3
	/// A 16-bit integer value used to initialize bits [63:48] of the result.			/// A 16-bit integer value used to initialize bits [63:48] of the result.
	/// \param __s2			/// \param __s2
	/// A 16-bit integer value used to initialize bits [47:32] of the result.			/// A 16-bit integer value used to initialize bits [47:32] of the result.
	/// \param __s1			/// \param __s1
	/// A 16-bit integer value used to initialize bits [31:16] of the result.			/// A 16-bit integer value used to initialize bits [31:16] of the result.
	/// \param __s0			/// \param __s0
	/// A 16-bit integer value used to initialize bits [15:0] of the result.			/// A 16-bit integer value used to initialize bits [15:0] of the result.
	/// \returns An initialized 64-bit integer vector.			/// \returns An initialized 64-bit integer vector.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_set_pi16(short __s3, short __s2, short __s1, short __s0)			_mm_set_pi16(short __s3, short __s2, short __s1, short __s0)
	{			{
	return (__m64)__builtin_ia32_vec_init_v4hi(__s0, __s1, __s2, __s3);			return __extension__ (__m64)(__v4hi){__s0, __s1, __s2, __s3};
	}			}

	/// Constructs a 64-bit integer vector initialized with the specified			/// Constructs a 64-bit integer vector initialized with the specified
	/// 8-bit integer values.			/// 8-bit integer values.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic is a utility function and does not correspond to a specific			/// This intrinsic is a utility function and does not correspond to a specific
	Show All 11 Lines
	/// An 8-bit integer value used to initialize bits [31:24] of the result.			/// An 8-bit integer value used to initialize bits [31:24] of the result.
	/// \param __b2			/// \param __b2
	/// An 8-bit integer value used to initialize bits [23:16] of the result.			/// An 8-bit integer value used to initialize bits [23:16] of the result.
	/// \param __b1			/// \param __b1
	/// An 8-bit integer value used to initialize bits [15:8] of the result.			/// An 8-bit integer value used to initialize bits [15:8] of the result.
	/// \param __b0			/// \param __b0
	/// An 8-bit integer value used to initialize bits [7:0] of the result.			/// An 8-bit integer value used to initialize bits [7:0] of the result.
	/// \returns An initialized 64-bit integer vector.			/// \returns An initialized 64-bit integer vector.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_set_pi8(char __b7, char __b6, char __b5, char __b4, char __b3, char __b2,			_mm_set_pi8(char __b7, char __b6, char __b5, char __b4, char __b3, char __b2,
	char __b1, char __b0)			char __b1, char __b0)
	{			{
	return (__m64)__builtin_ia32_vec_init_v8qi(__b0, __b1, __b2, __b3,			return __extension__ (__m64)(__v8qi){__b0, __b1, __b2, __b3,
	__b4, __b5, __b6, __b7);			__b4, __b5, __b6, __b7};
	}			}

	/// Constructs a 64-bit integer vector of [2 x i32], with each of the			/// Constructs a 64-bit integer vector of [2 x i32], with each of the
	/// 32-bit integer vector elements set to the specified 32-bit integer			/// 32-bit integer vector elements set to the specified 32-bit integer
	/// value.			/// value.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic is a utility function and does not correspond to a specific			/// This intrinsic is a utility function and does not correspond to a specific
	/// instruction.			/// instruction.
	///			///
	/// \param __i			/// \param __i
	/// A 32-bit integer value used to initialize each vector element of the			/// A 32-bit integer value used to initialize each vector element of the
	/// result.			/// result.
	/// \returns An initialized 64-bit integer vector of [2 x i32].			/// \returns An initialized 64-bit integer vector of [2 x i32].
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_set1_pi32(int __i)			_mm_set1_pi32(int __i)
	{			{
	return _mm_set_pi32(__i, __i);			return _mm_set_pi32(__i, __i);
				craig.topperUnsubmitted Done Reply Inline Actions Is this needed? craig.topper: Is this needed?
				jyknightAuthorUnsubmitted Done Reply Inline Actions No, reverted this change and the others like it. jyknight: No, reverted this change and the others like it.
	}			}

	/// Constructs a 64-bit integer vector of [4 x i16], with each of the			/// Constructs a 64-bit integer vector of [4 x i16], with each of the
	/// 16-bit integer vector elements set to the specified 16-bit integer			/// 16-bit integer vector elements set to the specified 16-bit integer
	/// value.			/// value.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic is a utility function and does not correspond to a specific			/// This intrinsic is a utility function and does not correspond to a specific
	/// instruction.			/// instruction.
	///			///
	/// \param __w			/// \param __w
	/// A 16-bit integer value used to initialize each vector element of the			/// A 16-bit integer value used to initialize each vector element of the
	/// result.			/// result.
	/// \returns An initialized 64-bit integer vector of [4 x i16].			/// \returns An initialized 64-bit integer vector of [4 x i16].
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_set1_pi16(short __w)			_mm_set1_pi16(short __w)
	{			{
	return _mm_set_pi16(__w, __w, __w, __w);			return _mm_set_pi16(__w, __w, __w, __w);
	}			}

	/// Constructs a 64-bit integer vector of [8 x i8], with each of the			/// Constructs a 64-bit integer vector of [8 x i8], with each of the
	/// 8-bit integer vector elements set to the specified 8-bit integer value.			/// 8-bit integer vector elements set to the specified 8-bit integer value.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic is a utility function and does not correspond to a specific			/// This intrinsic is a utility function and does not correspond to a specific
	/// instruction.			/// instruction.
	///			///
	/// \param __b			/// \param __b
	/// An 8-bit integer value used to initialize each vector element of the			/// An 8-bit integer value used to initialize each vector element of the
	/// result.			/// result.
	/// \returns An initialized 64-bit integer vector of [8 x i8].			/// \returns An initialized 64-bit integer vector of [8 x i8].
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_set1_pi8(char __b)			_mm_set1_pi8(char __b)
	{			{
	return _mm_set_pi8(__b, __b, __b, __b, __b, __b, __b, __b);			return _mm_set_pi8(__b, __b, __b, __b, __b, __b, __b, __b);
	}			}

	/// Constructs a 64-bit integer vector, initialized in reverse order with			/// Constructs a 64-bit integer vector, initialized in reverse order with
	/// the specified 32-bit integer values.			/// the specified 32-bit integer values.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic is a utility function and does not correspond to a specific			/// This intrinsic is a utility function and does not correspond to a specific
	/// instruction.			/// instruction.
	///			///
	/// \param __i0			/// \param __i0
	/// A 32-bit integer value used to initialize the lower 32 bits of the			/// A 32-bit integer value used to initialize the lower 32 bits of the
	/// result.			/// result.
	/// \param __i1			/// \param __i1
	/// A 32-bit integer value used to initialize the upper 32 bits of the			/// A 32-bit integer value used to initialize the upper 32 bits of the
	/// result.			/// result.
	/// \returns An initialized 64-bit integer vector.			/// \returns An initialized 64-bit integer vector.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_setr_pi32(int __i0, int __i1)			_mm_setr_pi32(int __i0, int __i1)
	{			{
	return _mm_set_pi32(__i1, __i0);			return _mm_set_pi32(__i1, __i0);
				craig.topperUnsubmitted Done Reply Inline Actions I don't think this change is needed. And I think the operands are in the wrong order. craig.topper: I don't think this change is needed. And I think the operands are in the wrong order.
				jyknightAuthorUnsubmitted Done Reply Inline Actions Change was unnecessary, so reverted. (But operands are supposed to be backwards here.) jyknight: Change was unnecessary, so reverted. (But operands are supposed to be backwards here.)
	}			}

	/// Constructs a 64-bit integer vector, initialized in reverse order with			/// Constructs a 64-bit integer vector, initialized in reverse order with
	/// the specified 16-bit integer values.			/// the specified 16-bit integer values.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic is a utility function and does not correspond to a specific			/// This intrinsic is a utility function and does not correspond to a specific
	/// instruction.			/// instruction.
	///			///
	/// \param __w0			/// \param __w0
	/// A 16-bit integer value used to initialize bits [15:0] of the result.			/// A 16-bit integer value used to initialize bits [15:0] of the result.
	/// \param __w1			/// \param __w1
	/// A 16-bit integer value used to initialize bits [31:16] of the result.			/// A 16-bit integer value used to initialize bits [31:16] of the result.
	/// \param __w2			/// \param __w2
	/// A 16-bit integer value used to initialize bits [47:32] of the result.			/// A 16-bit integer value used to initialize bits [47:32] of the result.
	/// \param __w3			/// \param __w3
	/// A 16-bit integer value used to initialize bits [63:48] of the result.			/// A 16-bit integer value used to initialize bits [63:48] of the result.
	/// \returns An initialized 64-bit integer vector.			/// \returns An initialized 64-bit integer vector.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_setr_pi16(short __w0, short __w1, short __w2, short __w3)			_mm_setr_pi16(short __w0, short __w1, short __w2, short __w3)
	{			{
	return _mm_set_pi16(__w3, __w2, __w1, __w0);			return _mm_set_pi16(__w3, __w2, __w1, __w0);
	}			}

	/// Constructs a 64-bit integer vector, initialized in reverse order with			/// Constructs a 64-bit integer vector, initialized in reverse order with
	/// the specified 8-bit integer values.			/// the specified 8-bit integer values.
	///			///
	Show All 14 Lines
	/// An 8-bit integer value used to initialize bits [39:32] of the result.			/// An 8-bit integer value used to initialize bits [39:32] of the result.
	/// \param __b5			/// \param __b5
	/// An 8-bit integer value used to initialize bits [47:40] of the result.			/// An 8-bit integer value used to initialize bits [47:40] of the result.
	/// \param __b6			/// \param __b6
	/// An 8-bit integer value used to initialize bits [55:48] of the result.			/// An 8-bit integer value used to initialize bits [55:48] of the result.
	/// \param __b7			/// \param __b7
	/// An 8-bit integer value used to initialize bits [63:56] of the result.			/// An 8-bit integer value used to initialize bits [63:56] of the result.
	/// \returns An initialized 64-bit integer vector.			/// \returns An initialized 64-bit integer vector.
	static __inline__ __m64 __DEFAULT_FN_ATTRS			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_setr_pi8(char __b0, char __b1, char __b2, char __b3, char __b4, char __b5,			_mm_setr_pi8(char __b0, char __b1, char __b2, char __b3, char __b4, char __b5,
	char __b6, char __b7)			char __b6, char __b7)
	{			{
	return _mm_set_pi8(__b7, __b6, __b5, __b4, __b3, __b2, __b1, __b0);			return _mm_set_pi8(__b7, __b6, __b5, __b4, __b3, __b2, __b1, __b0);
	}			}

	#undef __DEFAULT_FN_ATTRS			#undef __extract2_32
				#undef __anyext128
				#undef __trunc64
				#undef __DEFAULT_FN_ATTRS_SSE2

	/* Aliases for compatibility. */			/* Aliases for compatibility. */
	#define _m_empty _mm_empty			#define _m_empty _mm_empty
	#define _m_from_int _mm_cvtsi32_si64			#define _m_from_int _mm_cvtsi32_si64
	#define _m_from_int64 _mm_cvtsi64_m64			#define _m_from_int64 _mm_cvtsi64_m64
	#define _m_to_int _mm_cvtsi64_si32			#define _m_to_int _mm_cvtsi64_si32
	#define _m_to_int64 _mm_cvtm64_si64			#define _m_to_int64 _mm_cvtm64_si64
	#define _m_packsswb _mm_packs_pi16			#define _m_packsswb _mm_packs_pi16
	▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

clang/lib/Headers/tmmintrin.h

	/*===---- tmmintrin.h - SSSE3 intrinsics -----------------------------------===			/*===---- tmmintrin.h - SSSE3 intrinsics -----------------------------------===
	*			*
	* Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			* Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	* See https://llvm.org/LICENSE.txt for license information.			* See https://llvm.org/LICENSE.txt for license information.
	* SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			* SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	*			*
	*===-----------------------------------------------------------------------===			*===-----------------------------------------------------------------------===
	*/			*/

	#ifndef __TMMINTRIN_H			#ifndef __TMMINTRIN_H
	#define __TMMINTRIN_H			#define __TMMINTRIN_H

	#include <pmmintrin.h>			#include <pmmintrin.h>

	/* Define the default attributes for the functions in this file. */			/* Define the default attributes for the functions in this file. */
	#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("ssse3"), __min_vector_width__(64)))			#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("ssse3"), __min_vector_width__(64)))
	#define __DEFAULT_FN_ATTRS_MMX __attribute__((__always_inline__, __nodebug__, __target__("mmx,ssse3"), __min_vector_width__(64)))
				#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0)
				#define __anyext128(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, -1, -1)
				#define __extract2_32(a) (__m64)__builtin_shufflevector((__v4si)(a), __extension__ (__v4si){}, 0, 2);
				craig.topperUnsubmitted Not Done Reply Inline Actions I'm worried that using v1di with the shuffles will lead to scalarization in the type legalizer. Should we use v2si instead? craig.topper: I'm worried that using v1di with the shuffles will lead to scalarization in the type legalizer.
				jyknightAuthorUnsubmitted Not Done Reply Inline Actions Converting `__trunc64` to v4si (and thus v2si return value) seems to make codegen _worse_ in some cases, and I don't see any case where it gets better. For example, #define __trunc64_1(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0) #define __trunc64_2(x) (__m64)__builtin_shufflevector((__v4si)(x), __extension__ (__v4si){}, 0, 1) __m64 trunc1(__m128 a, int i) { return __trunc64_1(__builtin_ia32_psllqi128(a, i)); } __m64 trunc2(__m128 a, int i) { return __trunc64_2(__builtin_ia32_psllqi128(a, i)); } } In trunc2, you get two extraneous moves at the end: movd %edi, %xmm1 psllq %xmm1, %xmm0 movq %xmm0, %rax // extra movq %rax, %xmm0 // extra I guess that's related to calling-convention lowering which turns m64 into "double" confusing the various IR simplifications? Similarly, there's also extraneous moves to/from a GPR for argument passing sometimes. But I don't see an easy way around that. Both variants do that here, instead of just `movq %xmm0, %xmm0`: #define __anyext128_1(x) (__m128i)__builtin_shufflevector((__v1di)(x), __extension__ (__v1di){}, 0, -1) #define __anyext128_2(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, -1, -1) #define __zext128_1(x) (__m128i)__builtin_shufflevector((__v1di)(x), __extension__ (__v1di){}, 0, 1) #define __zext128_2(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, 2, 3) __m128 ext1(__m64 a) { return __builtin_convertvector((__v4si)__zext128_1(a), __v4sf)); } __m128 ext2(__m64 a) { return __builtin_convertvector((__v4si)__zext128_2(a), __v4sf)); } Both produce: movq %xmm0, %rax movq %rax, %xmm0 cvtdq2ps %xmm0, %xmm0 retq However, switching to variant 2 of `anyext128` and `zext128` does seem to improve things in other cases, avoiding _some_ of those sorts of extraneous moves to a scalar register and back again. So I've made that change. jyknight: Converting `__trunc64` to v4si (and thus v2si return value) seems to make codegen _worse_ in…

	/// Computes the absolute value of each of the packed 8-bit signed			/// Computes the absolute value of each of the packed 8-bit signed
	/// integers in the source operand and stores the 8-bit unsigned integer			/// integers in the source operand and stores the 8-bit unsigned integer
	/// results in the destination.			/// results in the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the \c PABSB instruction.			/// This intrinsic corresponds to the \c PABSB instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of [8 x i8].			/// A 64-bit vector of [8 x i8].
	/// \returns A 64-bit integer vector containing the absolute values of the			/// \returns A 64-bit integer vector containing the absolute values of the
	/// elements in the operand.			/// elements in the operand.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_abs_pi8(__m64 __a)			_mm_abs_pi8(__m64 __a)
	{			{
	return (__m64)__builtin_ia32_pabsb((__v8qi)__a);			return __trunc64(__builtin_ia32_pabsb128((__v16qi)__anyext128(__a)));
	}			}

	/// Computes the absolute value of each of the packed 8-bit signed			/// Computes the absolute value of each of the packed 8-bit signed
	/// integers in the source operand and stores the 8-bit unsigned integer			/// integers in the source operand and stores the 8-bit unsigned integer
	/// results in the destination.			/// results in the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	Show All 16 Lines
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the \c PABSW instruction.			/// This intrinsic corresponds to the \c PABSW instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of [4 x i16].			/// A 64-bit vector of [4 x i16].
	/// \returns A 64-bit integer vector containing the absolute values of the			/// \returns A 64-bit integer vector containing the absolute values of the
	/// elements in the operand.			/// elements in the operand.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_abs_pi16(__m64 __a)			_mm_abs_pi16(__m64 __a)
	{			{
	return (__m64)__builtin_ia32_pabsw((__v4hi)__a);			return __trunc64(__builtin_ia32_pabsw128((__v8hi)__anyext128(__a)));
	}			}

	/// Computes the absolute value of each of the packed 16-bit signed			/// Computes the absolute value of each of the packed 16-bit signed
	/// integers in the source operand and stores the 16-bit unsigned integer			/// integers in the source operand and stores the 16-bit unsigned integer
	/// results in the destination.			/// results in the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	Show All 16 Lines
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the \c PABSD instruction.			/// This intrinsic corresponds to the \c PABSD instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of [2 x i32].			/// A 64-bit vector of [2 x i32].
	/// \returns A 64-bit integer vector containing the absolute values of the			/// \returns A 64-bit integer vector containing the absolute values of the
	/// elements in the operand.			/// elements in the operand.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_abs_pi32(__m64 __a)			_mm_abs_pi32(__m64 __a)
	{			{
	return (__m64)__builtin_ia32_pabsd((__v2si)__a);			return __trunc64(__builtin_ia32_pabsd128((__v4si)__anyext128(__a)));
	}			}

	/// Computes the absolute value of each of the packed 32-bit signed			/// Computes the absolute value of each of the packed 32-bit signed
	/// integers in the source operand and stores the 32-bit unsigned integer			/// integers in the source operand and stores the 32-bit unsigned integer
	/// results in the destination.			/// results in the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	/// A 64-bit vector of [8 x i8] containing one of the source operands.			/// A 64-bit vector of [8 x i8] containing one of the source operands.
	/// \param b			/// \param b
	/// A 64-bit vector of [8 x i8] containing one of the source operands.			/// A 64-bit vector of [8 x i8] containing one of the source operands.
	/// \param n			/// \param n
	/// An immediate operand specifying how many bytes to right-shift the result.			/// An immediate operand specifying how many bytes to right-shift the result.
	/// \returns A 64-bit integer vector containing the concatenated right-shifted			/// \returns A 64-bit integer vector containing the concatenated right-shifted
	/// value.			/// value.
	#define _mm_alignr_pi8(a, b, n) \			#define _mm_alignr_pi8(a, b, n) \
	(__m64)__builtin_ia32_palignr((__v8qi)(__m64)(a), (__v8qi)(__m64)(b), (n))			(__m64)__builtin_shufflevector( \
				__builtin_ia32_psrldqi128_byteshift( \
				__builtin_shufflevector((__v1di)(a), (__v1di)(b), 1, 0), \
				(n)), __extension__ (__v2di){}, 0)

	/// Horizontally adds the adjacent pairs of values contained in 2 packed			/// Horizontally adds the adjacent pairs of values contained in 2 packed
	/// 128-bit vectors of [8 x i16].			/// 128-bit vectors of [8 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the \c VPHADDW instruction.			/// This intrinsic corresponds to the \c VPHADDW instruction.
	///			///
	▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	/// horizontal sums of the values are stored in the lower bits of the			/// horizontal sums of the values are stored in the lower bits of the
	/// destination.			/// destination.
	/// \param __b			/// \param __b
	/// A 64-bit vector of [4 x i16] containing one of the source operands. The			/// A 64-bit vector of [4 x i16] containing one of the source operands. The
	/// horizontal sums of the values are stored in the upper bits of the			/// horizontal sums of the values are stored in the upper bits of the
	/// destination.			/// destination.
	/// \returns A 64-bit vector of [4 x i16] containing the horizontal sums of both			/// \returns A 64-bit vector of [4 x i16] containing the horizontal sums of both
	/// operands.			/// operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_hadd_pi16(__m64 __a, __m64 __b)			_mm_hadd_pi16(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_phaddw((__v4hi)__a, (__v4hi)__b);			return __extract2_32(__builtin_ia32_phaddw128((__v8hi)__anyext128(__a),
				(__v8hi)__anyext128(__b)));
	}			}

	/// Horizontally adds the adjacent pairs of values contained in 2 packed			/// Horizontally adds the adjacent pairs of values contained in 2 packed
	/// 64-bit vectors of [2 x i32].			/// 64-bit vectors of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the \c PHADDD instruction.			/// This intrinsic corresponds to the \c PHADDD instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of [2 x i32] containing one of the source operands. The			/// A 64-bit vector of [2 x i32] containing one of the source operands. The
	/// horizontal sums of the values are stored in the lower bits of the			/// horizontal sums of the values are stored in the lower bits of the
	/// destination.			/// destination.
	/// \param __b			/// \param __b
	/// A 64-bit vector of [2 x i32] containing one of the source operands. The			/// A 64-bit vector of [2 x i32] containing one of the source operands. The
	/// horizontal sums of the values are stored in the upper bits of the			/// horizontal sums of the values are stored in the upper bits of the
	/// destination.			/// destination.
	/// \returns A 64-bit vector of [2 x i32] containing the horizontal sums of both			/// \returns A 64-bit vector of [2 x i32] containing the horizontal sums of both
	/// operands.			/// operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_hadd_pi32(__m64 __a, __m64 __b)			_mm_hadd_pi32(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_phaddd((__v2si)__a, (__v2si)__b);			return __extract2_32(__builtin_ia32_phaddd128((__v4si)__anyext128(__a),
				(__v4si)__anyext128(__b)));
	}			}

	/// Horizontally adds the adjacent pairs of values contained in 2 packed			/// Horizontally adds the adjacent pairs of values contained in 2 packed
	/// 128-bit vectors of [8 x i16]. Positive sums greater than 0x7FFF are			/// 128-bit vectors of [8 x i16]. Positive sums greater than 0x7FFF are
	/// saturated to 0x7FFF. Negative sums less than 0x8000 are saturated to			/// saturated to 0x7FFF. Negative sums less than 0x8000 are saturated to
	/// 0x8000.			/// 0x8000.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	Show All 30 Lines
	/// horizontal sums of the values are stored in the lower bits of the			/// horizontal sums of the values are stored in the lower bits of the
	/// destination.			/// destination.
	/// \param __b			/// \param __b
	/// A 64-bit vector of [4 x i16] containing one of the source operands. The			/// A 64-bit vector of [4 x i16] containing one of the source operands. The
	/// horizontal sums of the values are stored in the upper bits of the			/// horizontal sums of the values are stored in the upper bits of the
	/// destination.			/// destination.
	/// \returns A 64-bit vector of [4 x i16] containing the horizontal saturated			/// \returns A 64-bit vector of [4 x i16] containing the horizontal saturated
	/// sums of both operands.			/// sums of both operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_hadds_pi16(__m64 __a, __m64 __b)			_mm_hadds_pi16(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_phaddsw((__v4hi)__a, (__v4hi)__b);			return __extract2_32(__builtin_ia32_phaddsw128((__v8hi)__anyext128(__a),
				(__v8hi)__anyext128(__b)));
	}			}

	/// Horizontally subtracts the adjacent pairs of values contained in 2			/// Horizontally subtracts the adjacent pairs of values contained in 2
	/// packed 128-bit vectors of [8 x i16].			/// packed 128-bit vectors of [8 x i16].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the \c VPHSUBW instruction.			/// This intrinsic corresponds to the \c VPHSUBW instruction.
	▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	/// horizontal differences between the values are stored in the lower bits of			/// horizontal differences between the values are stored in the lower bits of
	/// the destination.			/// the destination.
	/// \param __b			/// \param __b
	/// A 64-bit vector of [4 x i16] containing one of the source operands. The			/// A 64-bit vector of [4 x i16] containing one of the source operands. The
	/// horizontal differences between the values are stored in the upper bits of			/// horizontal differences between the values are stored in the upper bits of
	/// the destination.			/// the destination.
	/// \returns A 64-bit vector of [4 x i16] containing the horizontal differences			/// \returns A 64-bit vector of [4 x i16] containing the horizontal differences
	/// of both operands.			/// of both operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_hsub_pi16(__m64 __a, __m64 __b)			_mm_hsub_pi16(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_phsubw((__v4hi)__a, (__v4hi)__b);			return __extract2_32(__builtin_ia32_phsubw128((__v8hi)__anyext128(__a),
				(__v8hi)__anyext128(__b)));
	}			}

	/// Horizontally subtracts the adjacent pairs of values contained in 2			/// Horizontally subtracts the adjacent pairs of values contained in 2
	/// packed 64-bit vectors of [2 x i32].			/// packed 64-bit vectors of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the \c PHSUBD instruction.			/// This intrinsic corresponds to the \c PHSUBD instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of [2 x i32] containing one of the source operands. The			/// A 64-bit vector of [2 x i32] containing one of the source operands. The
	/// horizontal differences between the values are stored in the lower bits of			/// horizontal differences between the values are stored in the lower bits of
	/// the destination.			/// the destination.
	/// \param __b			/// \param __b
	/// A 64-bit vector of [2 x i32] containing one of the source operands. The			/// A 64-bit vector of [2 x i32] containing one of the source operands. The
	/// horizontal differences between the values are stored in the upper bits of			/// horizontal differences between the values are stored in the upper bits of
	/// the destination.			/// the destination.
	/// \returns A 64-bit vector of [2 x i32] containing the horizontal differences			/// \returns A 64-bit vector of [2 x i32] containing the horizontal differences
	/// of both operands.			/// of both operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_hsub_pi32(__m64 __a, __m64 __b)			_mm_hsub_pi32(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_phsubd((__v2si)__a, (__v2si)__b);			return __extract2_32(__builtin_ia32_phsubd128((__v4si)__anyext128(__a),
				(__v4si)__anyext128(__b)));
	}			}

	/// Horizontally subtracts the adjacent pairs of values contained in 2			/// Horizontally subtracts the adjacent pairs of values contained in 2
	/// packed 128-bit vectors of [8 x i16]. Positive differences greater than			/// packed 128-bit vectors of [8 x i16]. Positive differences greater than
	/// 0x7FFF are saturated to 0x7FFF. Negative differences less than 0x8000 are			/// 0x7FFF are saturated to 0x7FFF. Negative differences less than 0x8000 are
	/// saturated to 0x8000.			/// saturated to 0x8000.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	Show All 30 Lines
	/// horizontal differences between the values are stored in the lower bits of			/// horizontal differences between the values are stored in the lower bits of
	/// the destination.			/// the destination.
	/// \param __b			/// \param __b
	/// A 64-bit vector of [4 x i16] containing one of the source operands. The			/// A 64-bit vector of [4 x i16] containing one of the source operands. The
	/// horizontal differences between the values are stored in the upper bits of			/// horizontal differences between the values are stored in the upper bits of
	/// the destination.			/// the destination.
	/// \returns A 64-bit vector of [4 x i16] containing the horizontal saturated			/// \returns A 64-bit vector of [4 x i16] containing the horizontal saturated
	/// differences of both operands.			/// differences of both operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_hsubs_pi16(__m64 __a, __m64 __b)			_mm_hsubs_pi16(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_phsubsw((__v4hi)__a, (__v4hi)__b);			return __extract2_32(__builtin_ia32_phsubsw128((__v8hi)__anyext128(__a),
				(__v8hi)__anyext128(__b)));
	}			}

	/// Multiplies corresponding pairs of packed 8-bit unsigned integer			/// Multiplies corresponding pairs of packed 8-bit unsigned integer
	/// values contained in the first source operand and packed 8-bit signed			/// values contained in the first source operand and packed 8-bit signed
	/// integer values contained in the second source operand, adds pairs of			/// integer values contained in the second source operand, adds pairs of
	/// contiguous products with signed saturation, and writes the 16-bit sums to			/// contiguous products with signed saturation, and writes the 16-bit sums to
	/// the corresponding bits in the destination.			/// the corresponding bits in the destination.
	///			///
	▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing the second source operand.			/// A 64-bit integer vector containing the second source operand.
	/// \returns A 64-bit integer vector containing the sums of products of both			/// \returns A 64-bit integer vector containing the sums of products of both
	/// operands: \n			/// operands: \n
	/// \a R0 := (\a __a0 * \a __b0) + (\a __a1 * \a __b1) \n			/// \a R0 := (\a __a0 * \a __b0) + (\a __a1 * \a __b1) \n
	/// \a R1 := (\a __a2 * \a __b2) + (\a __a3 * \a __b3) \n			/// \a R1 := (\a __a2 * \a __b2) + (\a __a3 * \a __b3) \n
	/// \a R2 := (\a __a4 * \a __b4) + (\a __a5 * \a __b5) \n			/// \a R2 := (\a __a4 * \a __b4) + (\a __a5 * \a __b5) \n
	/// \a R3 := (\a __a6 * \a __b6) + (\a __a7 * \a __b7)			/// \a R3 := (\a __a6 * \a __b6) + (\a __a7 * \a __b7)
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_maddubs_pi16(__m64 __a, __m64 __b)			_mm_maddubs_pi16(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_pmaddubsw((__v8qi)__a, (__v8qi)__b);			return __trunc64(__builtin_ia32_pmaddubsw128((__v16qi)__anyext128(__a),
				(__v16qi)__anyext128(__b)));
	}			}

	/// Multiplies packed 16-bit signed integer values, truncates the 32-bit			/// Multiplies packed 16-bit signed integer values, truncates the 32-bit
	/// products to the 18 most significant bits by right-shifting, rounds the			/// products to the 18 most significant bits by right-shifting, rounds the
	/// truncated value by adding 1, and writes bits [16:1] to the destination.			/// truncated value by adding 1, and writes bits [16:1] to the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	Show All 20 Lines
	/// This intrinsic corresponds to the \c PMULHRSW instruction.			/// This intrinsic corresponds to the \c PMULHRSW instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of [4 x i16] containing one of the source operands.			/// A 64-bit vector of [4 x i16] containing one of the source operands.
	/// \param __b			/// \param __b
	/// A 64-bit vector of [4 x i16] containing one of the source operands.			/// A 64-bit vector of [4 x i16] containing one of the source operands.
	/// \returns A 64-bit vector of [4 x i16] containing the rounded and scaled			/// \returns A 64-bit vector of [4 x i16] containing the rounded and scaled
	/// products of both operands.			/// products of both operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_mulhrs_pi16(__m64 __a, __m64 __b)			_mm_mulhrs_pi16(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_pmulhrsw((__v4hi)__a, (__v4hi)__b);			return __trunc64(__builtin_ia32_pmulhrsw128((__v8hi)__anyext128(__a),
				(__v8hi)__anyext128(__b)));
	}			}

	/// Copies the 8-bit integers from a 128-bit integer vector to the			/// Copies the 8-bit integers from a 128-bit integer vector to the
	/// destination or clears 8-bit values in the destination, as specified by			/// destination or clears 8-bit values in the destination, as specified by
	/// the second source operand.			/// the second source operand.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	Show All 29 Lines
	/// A 64-bit integer vector containing the values to be copied.			/// A 64-bit integer vector containing the values to be copied.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing control bytes corresponding to			/// A 64-bit integer vector containing control bytes corresponding to
	/// positions in the destination:			/// positions in the destination:
	/// Bit 7: \n			/// Bit 7: \n
	/// 1: Clear the corresponding byte in the destination. \n			/// 1: Clear the corresponding byte in the destination. \n
	/// 0: Copy the selected source byte to the corresponding byte in the			/// 0: Copy the selected source byte to the corresponding byte in the
	/// destination. \n			/// destination. \n
	/// Bits [3:0] select the source byte to be copied.			/// Bits [2:0] select the source byte to be copied.
	/// \returns A 64-bit integer vector containing the copied or cleared values.			/// \returns A 64-bit integer vector containing the copied or cleared values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_shuffle_pi8(__m64 __a, __m64 __b)			_mm_shuffle_pi8(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_pshufb((__v8qi)__a, (__v8qi)__b);			return __trunc64(__builtin_ia32_pshufb128(
				(__v16qi)__builtin_shufflevector(
				(__v2si)(__a), __extension__ (__v2si){}, 0, 1, 0, 1),
				(__v16qi)__anyext128(__b)));
	}			}

	/// For each 8-bit integer in the first source operand, perform one of			/// For each 8-bit integer in the first source operand, perform one of
	/// the following actions as specified by the second source operand.			/// the following actions as specified by the second source operand.
	///			///
	/// If the byte in the second source is negative, calculate the two's			/// If the byte in the second source is negative, calculate the two's
	/// complement of the corresponding byte in the first source, and write that			/// complement of the corresponding byte in the first source, and write that
	/// value to the destination. If the byte in the second source is positive,			/// value to the destination. If the byte in the second source is positive,
	▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
	/// This intrinsic corresponds to the \c PSIGNB instruction.			/// This intrinsic corresponds to the \c PSIGNB instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing the values to be copied.			/// A 64-bit integer vector containing the values to be copied.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing control bytes corresponding to			/// A 64-bit integer vector containing control bytes corresponding to
	/// positions in the destination.			/// positions in the destination.
	/// \returns A 64-bit integer vector containing the resultant values.			/// \returns A 64-bit integer vector containing the resultant values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_sign_pi8(__m64 __a, __m64 __b)			_mm_sign_pi8(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_psignb((__v8qi)__a, (__v8qi)__b);			return __trunc64(__builtin_ia32_psignb128((__v16qi)__anyext128(__a),
				(__v16qi)__anyext128(__b)));
	}			}

	/// For each 16-bit integer in the first source operand, perform one of			/// For each 16-bit integer in the first source operand, perform one of
	/// the following actions as specified by the second source operand.			/// the following actions as specified by the second source operand.
	///			///
	/// If the word in the second source is negative, calculate the two's			/// If the word in the second source is negative, calculate the two's
	/// complement of the corresponding word in the first source, and write that			/// complement of the corresponding word in the first source, and write that
	/// value to the destination. If the word in the second source is positive,			/// value to the destination. If the word in the second source is positive,
	/// copy the corresponding word from the first source to the destination. If			/// copy the corresponding word from the first source to the destination. If
	/// the word in the second source is zero, clear the corresponding word in			/// the word in the second source is zero, clear the corresponding word in
	/// the destination.			/// the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the \c PSIGNW instruction.			/// This intrinsic corresponds to the \c PSIGNW instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing the values to be copied.			/// A 64-bit integer vector containing the values to be copied.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing control words corresponding to			/// A 64-bit integer vector containing control words corresponding to
	/// positions in the destination.			/// positions in the destination.
	/// \returns A 64-bit integer vector containing the resultant values.			/// \returns A 64-bit integer vector containing the resultant values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_sign_pi16(__m64 __a, __m64 __b)			_mm_sign_pi16(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_psignw((__v4hi)__a, (__v4hi)__b);			return __trunc64(__builtin_ia32_psignw128((__v8hi)__anyext128(__a),
				(__v8hi)__anyext128(__b)));
	}			}

	/// For each 32-bit integer in the first source operand, perform one of			/// For each 32-bit integer in the first source operand, perform one of
	/// the following actions as specified by the second source operand.			/// the following actions as specified by the second source operand.
	///			///
	/// If the doubleword in the second source is negative, calculate the two's			/// If the doubleword in the second source is negative, calculate the two's
	/// complement of the corresponding doubleword in the first source, and			/// complement of the corresponding doubleword in the first source, and
	/// write that value to the destination. If the doubleword in the second			/// write that value to the destination. If the doubleword in the second
	/// source is positive, copy the corresponding doubleword from the first			/// source is positive, copy the corresponding doubleword from the first
	/// source to the destination. If the doubleword in the second source is			/// source to the destination. If the doubleword in the second source is
	/// zero, clear the corresponding doubleword in the destination.			/// zero, clear the corresponding doubleword in the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the \c PSIGND instruction.			/// This intrinsic corresponds to the \c PSIGND instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing the values to be copied.			/// A 64-bit integer vector containing the values to be copied.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing two control doublewords corresponding			/// A 64-bit integer vector containing two control doublewords corresponding
	/// to positions in the destination.			/// to positions in the destination.
	/// \returns A 64-bit integer vector containing the resultant values.			/// \returns A 64-bit integer vector containing the resultant values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS
	_mm_sign_pi32(__m64 __a, __m64 __b)			_mm_sign_pi32(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_psignd((__v2si)__a, (__v2si)__b);			return __trunc64(__builtin_ia32_psignd128((__v4si)__anyext128(__a),
				(__v4si)__anyext128(__b)));
	}			}

				#undef __extract2_32
				#undef __anyext128
				#undef __trunc64
	#undef __DEFAULT_FN_ATTRS			#undef __DEFAULT_FN_ATTRS
	#undef __DEFAULT_FN_ATTRS_MMX

	#endif /* __TMMINTRIN_H */			#endif /* __TMMINTRIN_H */

clang/lib/Headers/xmmintrin.h

	Show All 23 Lines
	/* This header should only be included in a hosted environment as it depends on			/* This header should only be included in a hosted environment as it depends on
	* a standard library to provide allocation routines. */			* a standard library to provide allocation routines. */
	#if __STDC_HOSTED__			#if __STDC_HOSTED__
	#include <mm_malloc.h>			#include <mm_malloc.h>
	#endif			#endif

	/* Define the default attributes for the functions in this file. */			/* Define the default attributes for the functions in this file. */
	#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("sse"), __min_vector_width__(128)))			#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("sse"), __min_vector_width__(128)))
	#define __DEFAULT_FN_ATTRS_MMX __attribute__((__always_inline__, __nodebug__, __target__("mmx,sse"), __min_vector_width__(64)))			#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, __target__("sse2"), __min_vector_width__(64)))

				#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0)
				#define __zext128(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, 2, 3)
				#define __anyext128(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, -1, -1)
				#define __zeroupper64(x) (__m128i)__builtin_shufflevector((__v4si)(x), __extension__ (__v4si){}, 0, 1, 4, 5)

	/// Adds the 32-bit float values in the low-order bits of the operands.			/// Adds the 32-bit float values in the low-order bits of the operands.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> VADDSS / ADDSS </c> instructions.			/// This intrinsic corresponds to the <c> VADDSS / ADDSS </c> instructions.
	///			///
	/// \param __a			/// \param __a
	▲ Show 20 Lines • Show All 1,308 Lines • ▼ Show 20 Lines
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTPS2PI </c> instruction.			/// This intrinsic corresponds to the <c> CVTPS2PI </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 128-bit vector of [4 x float].			/// A 128-bit vector of [4 x float].
	/// \returns A 64-bit integer vector containing the converted values.			/// \returns A 64-bit integer vector containing the converted values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtps_pi32(__m128 __a)			_mm_cvtps_pi32(__m128 __a)
	{			{
	return (__m64)__builtin_ia32_cvtps2pi((__v4sf)__a);			return __trunc64(__builtin_ia32_cvtps2dq((__v4sf)__zeroupper64(__a)));
	}			}

	/// Converts two low-order float values in a 128-bit vector of			/// Converts two low-order float values in a 128-bit vector of
	/// [4 x float] into a 64-bit vector of [2 x i32].			/// [4 x float] into a 64-bit vector of [2 x i32].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTPS2PI </c> instruction.			/// This intrinsic corresponds to the <c> CVTPS2PI </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 128-bit vector of [4 x float].			/// A 128-bit vector of [4 x float].
	/// \returns A 64-bit integer vector containing the converted values.			/// \returns A 64-bit integer vector containing the converted values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvt_ps2pi(__m128 __a)			_mm_cvt_ps2pi(__m128 __a)
	{			{
	return _mm_cvtps_pi32(__a);			return _mm_cvtps_pi32(__a);
	}			}

	/// Converts a float value contained in the lower 32 bits of a vector of			/// Converts a float value contained in the lower 32 bits of a vector of
	/// [4 x float] into a 32-bit integer, truncating the result when it is			/// [4 x float] into a 32-bit integer, truncating the result when it is
	/// inexact.			/// inexact.
	▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTTPS2PI / VTTPS2PI </c>			/// This intrinsic corresponds to the <c> CVTTPS2PI / VTTPS2PI </c>
	/// instructions.			/// instructions.
	///			///
	/// \param __a			/// \param __a
	/// A 128-bit vector of [4 x float].			/// A 128-bit vector of [4 x float].
	/// \returns A 64-bit integer vector containing the converted values.			/// \returns A 64-bit integer vector containing the converted values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvttps_pi32(__m128 __a)			_mm_cvttps_pi32(__m128 __a)
	{			{
	return (__m64)__builtin_ia32_cvttps2pi((__v4sf)__a);			return __trunc64(__builtin_ia32_cvttps2dq((__v4sf)__zeroupper64(__a)));
	}			}

	/// Converts two low-order float values in a 128-bit vector of [4 x			/// Converts two low-order float values in a 128-bit vector of [4 x
	/// float] into a 64-bit vector of [2 x i32], truncating the result when it			/// float] into a 64-bit vector of [2 x i32], truncating the result when it
	/// is inexact.			/// is inexact.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTTPS2PI </c> instruction.			/// This intrinsic corresponds to the <c> CVTTPS2PI </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 128-bit vector of [4 x float].			/// A 128-bit vector of [4 x float].
	/// \returns A 64-bit integer vector containing the converted values.			/// \returns A 64-bit integer vector containing the converted values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtt_ps2pi(__m128 __a)			_mm_cvtt_ps2pi(__m128 __a)
	{			{
	return _mm_cvttps_pi32(__a);			return _mm_cvttps_pi32(__a);
	}			}

	/// Converts a 32-bit signed integer value into a floating point value			/// Converts a 32-bit signed integer value into a floating point value
	/// and writes it to the lower 32 bits of the destination. The remaining			/// and writes it to the lower 32 bits of the destination. The remaining
	/// higher order elements of the destination vector are copied from the			/// higher order elements of the destination vector are copied from the
	▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
	/// \param __a			/// \param __a
	/// A 128-bit vector of [4 x float].			/// A 128-bit vector of [4 x float].
	/// \param __b			/// \param __b
	/// A 64-bit vector of [2 x i32]. The elements in this vector are converted			/// A 64-bit vector of [2 x i32]. The elements in this vector are converted
	/// and written to the corresponding low-order elements in the destination.			/// and written to the corresponding low-order elements in the destination.
	/// \returns A 128-bit vector of [4 x float] whose lower 64 bits contain the			/// \returns A 128-bit vector of [4 x float] whose lower 64 bits contain the
	/// converted value of the second operand. The upper 64 bits are copied from			/// converted value of the second operand. The upper 64 bits are copied from
	/// the upper 64 bits of the first operand.			/// the upper 64 bits of the first operand.
	static __inline__ __m128 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m128 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtpi32_ps(__m128 __a, __m64 __b)			_mm_cvtpi32_ps(__m128 __a, __m64 __b)
	{			{
	return __builtin_ia32_cvtpi2ps((__v4sf)__a, (__v2si)__b);			return (__m128)__builtin_shufflevector(
				(__v4sf)__a,
				__builtin_convertvector((__v4si)__zext128(__b), __v4sf),
				4, 5, 2, 3);
	}			}

	/// Converts two elements of a 64-bit vector of [2 x i32] into two			/// Converts two elements of a 64-bit vector of [2 x i32] into two
	/// floating point values and writes them to the lower 64-bits of the			/// floating point values and writes them to the lower 64-bits of the
	/// destination. The remaining higher order elements of the destination are			/// destination. The remaining higher order elements of the destination are
	/// copied from the corresponding elements in the first operand.			/// copied from the corresponding elements in the first operand.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTPI2PS </c> instruction.			/// This intrinsic corresponds to the <c> CVTPI2PS </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 128-bit vector of [4 x float].			/// A 128-bit vector of [4 x float].
	/// \param __b			/// \param __b
	/// A 64-bit vector of [2 x i32]. The elements in this vector are converted			/// A 64-bit vector of [2 x i32]. The elements in this vector are converted
	/// and written to the corresponding low-order elements in the destination.			/// and written to the corresponding low-order elements in the destination.
	/// \returns A 128-bit vector of [4 x float] whose lower 64 bits contain the			/// \returns A 128-bit vector of [4 x float] whose lower 64 bits contain the
	/// converted value from the second operand. The upper 64 bits are copied			/// converted value from the second operand. The upper 64 bits are copied
	/// from the upper 64 bits of the first operand.			/// from the upper 64 bits of the first operand.
	static __inline__ __m128 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m128 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvt_pi2ps(__m128 __a, __m64 __b)			_mm_cvt_pi2ps(__m128 __a, __m64 __b)
	{			{
	return _mm_cvtpi32_ps(__a, __b);			return _mm_cvtpi32_ps(__a, __b);
	}			}

	/// Extracts a float value contained in the lower 32 bits of a vector of			/// Extracts a float value contained in the lower 32 bits of a vector of
	/// [4 x float].			/// [4 x float].
	///			///
	▲ Show 20 Lines • Show All 517 Lines • ▼ Show 20 Lines
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> MOVNTQ </c> instruction.			/// This intrinsic corresponds to the <c> MOVNTQ </c> instruction.
	///			///
	/// \param __p			/// \param __p
	/// A pointer to an aligned memory location used to store the register value.			/// A pointer to an aligned memory location used to store the register value.
	/// \param __a			/// \param __a
	/// A 64-bit integer containing the value to be stored.			/// A 64-bit integer containing the value to be stored.
	static __inline__ void __DEFAULT_FN_ATTRS_MMX			static __inline__ void __DEFAULT_FN_ATTRS
	_mm_stream_pi(__m64 *__p, __m64 __a)			_mm_stream_pi(__m64 *__p, __m64 __a)
	{			{
	__builtin_ia32_movntq(__p, __a);			__builtin_nontemporal_store(__a, __p);
	}			}

	/// Moves packed float values from a 128-bit vector of [4 x float] to a			/// Moves packed float values from a 128-bit vector of [4 x float] to a
	/// 128-bit aligned memory location. To minimize caching, the data is flagged			/// 128-bit aligned memory location. To minimize caching, the data is flagged
	/// as non-temporal (unlikely to be used again soon).			/// as non-temporal (unlikely to be used again soon).
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	/// \param n			/// \param n
	/// An immediate integer operand that determines which bits are extracted: \n			/// An immediate integer operand that determines which bits are extracted: \n
	/// 0: Bits [15:0] are copied to the destination. \n			/// 0: Bits [15:0] are copied to the destination. \n
	/// 1: Bits [31:16] are copied to the destination. \n			/// 1: Bits [31:16] are copied to the destination. \n
	/// 2: Bits [47:32] are copied to the destination. \n			/// 2: Bits [47:32] are copied to the destination. \n
	/// 3: Bits [63:48] are copied to the destination.			/// 3: Bits [63:48] are copied to the destination.
	/// \returns A 16-bit integer containing the extracted 16 bits of packed data.			/// \returns A 16-bit integer containing the extracted 16 bits of packed data.
	#define _mm_extract_pi16(a, n) \			#define _mm_extract_pi16(a, n) \
	(int)__builtin_ia32_vec_ext_v4hi((__v4hi)a, (int)n)			(int)(unsigned short)__builtin_ia32_vec_ext_v4hi((__v4hi)a, (int)n)

	/// Copies data from the 64-bit vector of [4 x i16] to the destination,			/// Copies data from the 64-bit vector of [4 x i16] to the destination,
	/// and inserts the lower 16-bits of an integer operand at the 16-bit offset			/// and inserts the lower 16-bits of an integer operand at the 16-bit offset
	/// specified by the immediate operand \a n.			/// specified by the immediate operand \a n.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// \code			/// \code
	Show All 29 Lines
	///			///
	/// This intrinsic corresponds to the <c> PMAXSW </c> instruction.			/// This intrinsic corresponds to the <c> PMAXSW </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \returns A 64-bit integer vector containing the comparison results.			/// \returns A 64-bit integer vector containing the comparison results.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_max_pi16(__m64 __a, __m64 __b)			_mm_max_pi16(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_pmaxsw((__v4hi)__a, (__v4hi)__b);			return __trunc64(__builtin_ia32_pmaxsw128((__v8hi)__anyext128(__a),
				(__v8hi)__anyext128(__b)));
	}			}

	/// Compares each of the corresponding packed 8-bit unsigned integer			/// Compares each of the corresponding packed 8-bit unsigned integer
	/// values of the 64-bit integer vectors, and writes the greater value to the			/// values of the 64-bit integer vectors, and writes the greater value to the
	/// corresponding bits in the destination.			/// corresponding bits in the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PMAXUB </c> instruction.			/// This intrinsic corresponds to the <c> PMAXUB </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \returns A 64-bit integer vector containing the comparison results.			/// \returns A 64-bit integer vector containing the comparison results.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_max_pu8(__m64 __a, __m64 __b)			_mm_max_pu8(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_pmaxub((__v8qi)__a, (__v8qi)__b);			return __trunc64(__builtin_ia32_pmaxub128((__v16qi)__anyext128(__a),
				(__v16qi)__anyext128(__b)));
	}			}

	/// Compares each of the corresponding packed 16-bit integer values of			/// Compares each of the corresponding packed 16-bit integer values of
	/// the 64-bit integer vectors, and writes the lesser value to the			/// the 64-bit integer vectors, and writes the lesser value to the
	/// corresponding bits in the destination.			/// corresponding bits in the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PMINSW </c> instruction.			/// This intrinsic corresponds to the <c> PMINSW </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \returns A 64-bit integer vector containing the comparison results.			/// \returns A 64-bit integer vector containing the comparison results.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_min_pi16(__m64 __a, __m64 __b)			_mm_min_pi16(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_pminsw((__v4hi)__a, (__v4hi)__b);			return __trunc64(__builtin_ia32_pminsw128((__v8hi)__anyext128(__a),
				(__v8hi)__anyext128(__b)));
	}			}

	/// Compares each of the corresponding packed 8-bit unsigned integer			/// Compares each of the corresponding packed 8-bit unsigned integer
	/// values of the 64-bit integer vectors, and writes the lesser value to the			/// values of the 64-bit integer vectors, and writes the lesser value to the
	/// corresponding bits in the destination.			/// corresponding bits in the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PMINUB </c> instruction.			/// This intrinsic corresponds to the <c> PMINUB </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \returns A 64-bit integer vector containing the comparison results.			/// \returns A 64-bit integer vector containing the comparison results.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_min_pu8(__m64 __a, __m64 __b)			_mm_min_pu8(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_pminub((__v8qi)__a, (__v8qi)__b);			return __trunc64(__builtin_ia32_pminub128((__v16qi)__anyext128(__a),
				(__v16qi)__anyext128(__b)));
	}			}

	/// Takes the most significant bit from each 8-bit element in a 64-bit			/// Takes the most significant bit from each 8-bit element in a 64-bit
	/// integer vector to create an 8-bit mask value. Zero-extends the value to			/// integer vector to create an 8-bit mask value. Zero-extends the value to
	/// 32-bit integer and writes it to the destination.			/// 32-bit integer and writes it to the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PMOVMSKB </c> instruction.			/// This intrinsic corresponds to the <c> PMOVMSKB </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing the values with bits to be extracted.			/// A 64-bit integer vector containing the values with bits to be extracted.
	/// \returns The most significant bit from each 8-bit element in \a __a,			/// \returns The most significant bit from each 8-bit element in \a __a,
	/// written to bits [7:0].			/// written to bits [7:0].
	static __inline__ int __DEFAULT_FN_ATTRS_MMX			static __inline__ int __DEFAULT_FN_ATTRS_SSE2
	_mm_movemask_pi8(__m64 __a)			_mm_movemask_pi8(__m64 __a)
	{			{
	return __builtin_ia32_pmovmskb((__v8qi)__a);			return __builtin_ia32_pmovmskb128((__v16qi)__zext128(__a));
				craig.topperUnsubmitted Done Reply Inline Actions This doesn't guarantee zeroes in bits 15:8 does it? craig.topper: This doesn't guarantee zeroes in bits 15:8 does it?
				jyknightAuthorUnsubmitted Done Reply Inline Actions It does not. Switched to zext128. jyknight: It does not. Switched to zext128.
	}			}

	/// Multiplies packed 16-bit unsigned integer values and writes the			/// Multiplies packed 16-bit unsigned integer values and writes the
	/// high-order 16 bits of each 32-bit product to the corresponding bits in			/// high-order 16 bits of each 32-bit product to the corresponding bits in
	/// the destination.			/// the destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PMULHUW </c> instruction.			/// This intrinsic corresponds to the <c> PMULHUW </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \returns A 64-bit integer vector containing the products of both operands.			/// \returns A 64-bit integer vector containing the products of both operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_mulhi_pu16(__m64 __a, __m64 __b)			_mm_mulhi_pu16(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_pmulhuw((__v4hi)__a, (__v4hi)__b);			return __trunc64(__builtin_ia32_pmulhuw128((__v8hi)__anyext128(__a),
				(__v8hi)__anyext128(__b)));
	}			}

	/// Shuffles the 4 16-bit integers from a 64-bit integer vector to the			/// Shuffles the 4 16-bit integers from a 64-bit integer vector to the
	/// destination, as specified by the immediate value operand.			/// destination, as specified by the immediate value operand.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// \code			/// \code
	Show All 18 Lines
	/// destination. \n			/// destination. \n
	/// Bit value assignments: \n			/// Bit value assignments: \n
	/// 00: assigned from bits [15:0] of \a a. \n			/// 00: assigned from bits [15:0] of \a a. \n
	/// 01: assigned from bits [31:16] of \a a. \n			/// 01: assigned from bits [31:16] of \a a. \n
	/// 10: assigned from bits [47:32] of \a a. \n			/// 10: assigned from bits [47:32] of \a a. \n
	/// 11: assigned from bits [63:48] of \a a.			/// 11: assigned from bits [63:48] of \a a.
	/// \returns A 64-bit integer vector containing the shuffled values.			/// \returns A 64-bit integer vector containing the shuffled values.
	#define _mm_shuffle_pi16(a, n) \			#define _mm_shuffle_pi16(a, n) \
	(__m64)__builtin_ia32_pshufw((__v4hi)(__m64)(a), (n))			(__m64)__builtin_shufflevector((__v4hi)(__m64)(a), __extension__ (__v4hi){}, \
				(n) & 0x3, ((n) >> 2) & 0x3, \
				((n) >> 4) & 0x3, ((n) >> 6) & 0x3)

	/// Conditionally copies the values from each 8-bit element in the first			/// Conditionally copies the values from each 8-bit element in the first
	/// 64-bit integer vector operand to the specified memory location, as			/// 64-bit integer vector operand to the specified memory location, as
	/// specified by the most significant bit in the corresponding element in the			/// specified by the most significant bit in the corresponding element in the
	/// second 64-bit integer vector operand.			/// second 64-bit integer vector operand.
	///			///
	/// To minimize caching, the data is flagged as non-temporal			/// To minimize caching, the data is flagged as non-temporal
	/// (unlikely to be used again soon).			/// (unlikely to be used again soon).
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> MASKMOVQ </c> instruction.			/// This intrinsic corresponds to the <c> MASKMOVQ </c> instruction.
	///			///
	/// \param __d			/// \param __d
	/// A 64-bit integer vector containing the values with elements to be copied.			/// A 64-bit integer vector containing the values with elements to be copied.
	/// \param __n			/// \param __n
	/// A 64-bit integer vector operand. The most significant bit from each 8-bit			/// A 64-bit integer vector operand. The most significant bit from each 8-bit
	/// element determines whether the corresponding element in operand \a __d			/// element determines whether the corresponding element in operand \a __d
	/// is copied. If the most significant bit of a given element is 1, the			/// is copied. If the most significant bit of a given element is 1, the
	/// corresponding element in operand \a __d is copied.			/// corresponding element in operand \a __d is copied.
	/// \param __p			/// \param __p
	/// A pointer to a 64-bit memory location that will receive the conditionally			/// A pointer to a 64-bit memory location that will receive the conditionally
	/// copied integer values. The address of the memory location does not have			/// copied integer values. The address of the memory location does not have
	/// to be aligned.			/// to be aligned.
	static __inline__ void __DEFAULT_FN_ATTRS_MMX			static __inline__ void __DEFAULT_FN_ATTRS_SSE2
	_mm_maskmove_si64(__m64 __d, __m64 __n, char *__p)			_mm_maskmove_si64(__m64 __d, __m64 __n, char *__p)
	{			{
	__builtin_ia32_maskmovq((__v8qi)__d, (__v8qi)__n, __p);			// This is complex, because we need to support the case where __p is pointing
				// within the last 15 to 8 bytes of a page. In that case, using a 128-bit
				// write might cause a trap where a 64-bit maskmovq would not. (Memory
				// locations not selected by the mask bits might still cause traps.)
				__m128i __d128 = __anyext128(__d);
				__m128i __n128 = __zext128(__n);
				if (((__SIZE_TYPE__)__p & 0xfff) >= 4096-15 &&
				craig.topperUnsubmitted Done Reply Inline Actions Does this work with large pages? craig.topper: Does this work with large pages?
				jyknightAuthorUnsubmitted Done Reply Inline Actions Yes -- this needs to be the boundary at which a trap _might_ occur if we crossed it. Whether it's in fact the end of of a page or not is irrelevant, only that it _could_ be. jyknight: Yes -- this needs to be the boundary at which a trap _might_ occur if we crossed it. Whether…
				((__SIZE_TYPE__)__p & 0xfff) <= 4096-8) {
				// If there's a risk of spurious trap due to a 128-bit write, back up the
				// pointer by 8 bytes and shift values in registers to match.
				__p -= 8;
				__d128 = __builtin_ia32_pslldqi128_byteshift((__v2di)__d128, 8);
				__n128 = __builtin_ia32_pslldqi128_byteshift((__v2di)__n128, 8);
				}

				__builtin_ia32_maskmovdqu((__v16qi)__d128, (__v16qi)__n128, __p);
	}			}

	/// Computes the rounded averages of the packed unsigned 8-bit integer			/// Computes the rounded averages of the packed unsigned 8-bit integer
	/// values and writes the averages to the corresponding bits in the			/// values and writes the averages to the corresponding bits in the
	/// destination.			/// destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PAVGB </c> instruction.			/// This intrinsic corresponds to the <c> PAVGB </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \returns A 64-bit integer vector containing the averages of both operands.			/// \returns A 64-bit integer vector containing the averages of both operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_avg_pu8(__m64 __a, __m64 __b)			_mm_avg_pu8(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_pavgb((__v8qi)__a, (__v8qi)__b);			return __trunc64(__builtin_ia32_pavgb128((__v16qi)__anyext128(__a),
				(__v16qi)__anyext128(__b)));
	}			}

	/// Computes the rounded averages of the packed unsigned 16-bit integer			/// Computes the rounded averages of the packed unsigned 16-bit integer
	/// values and writes the averages to the corresponding bits in the			/// values and writes the averages to the corresponding bits in the
	/// destination.			/// destination.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PAVGW </c> instruction.			/// This intrinsic corresponds to the <c> PAVGW </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \returns A 64-bit integer vector containing the averages of both operands.			/// \returns A 64-bit integer vector containing the averages of both operands.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_avg_pu16(__m64 __a, __m64 __b)			_mm_avg_pu16(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_pavgw((__v4hi)__a, (__v4hi)__b);			return __trunc64(__builtin_ia32_pavgw128((__v8hi)__anyext128(__a),
				(__v8hi)__anyext128(__b)));
	}			}

	/// Subtracts the corresponding 8-bit unsigned integer values of the two			/// Subtracts the corresponding 8-bit unsigned integer values of the two
	/// 64-bit vector operands and computes the absolute value for each of the			/// 64-bit vector operands and computes the absolute value for each of the
	/// difference. Then sum of the 8 absolute differences is written to the			/// difference. Then sum of the 8 absolute differences is written to the
	/// bits [15:0] of the destination; the remaining bits [63:16] are cleared.			/// bits [15:0] of the destination; the remaining bits [63:16] are cleared.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> PSADBW </c> instruction.			/// This intrinsic corresponds to the <c> PSADBW </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \param __b			/// \param __b
	/// A 64-bit integer vector containing one of the source operands.			/// A 64-bit integer vector containing one of the source operands.
	/// \returns A 64-bit integer vector whose lower 16 bits contain the sums of the			/// \returns A 64-bit integer vector whose lower 16 bits contain the sums of the
	/// sets of absolute differences between both operands. The upper bits are			/// sets of absolute differences between both operands. The upper bits are
	/// cleared.			/// cleared.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_sad_pu8(__m64 __a, __m64 __b)			_mm_sad_pu8(__m64 __a, __m64 __b)
	{			{
	return (__m64)__builtin_ia32_psadbw((__v8qi)__a, (__v8qi)__b);			return __trunc64(__builtin_ia32_psadbw128((__v16qi)__zext128(__a),
				(__v16qi)__zext128(__b)));
	}			}

	#if defined(__cplusplus)			#if defined(__cplusplus)
	extern "C" {			extern "C" {
	#endif			#endif

	/// Returns the contents of the MXCSR register as a 32-bit unsigned			/// Returns the contents of the MXCSR register as a 32-bit unsigned
	/// integer value.			/// integer value.
	▲ Show 20 Lines • Show All 261 Lines • ▼ Show 20 Lines
	///			///
	/// This intrinsic corresponds to the <c> CVTPI2PS + COMPOSITE </c> instruction.			/// This intrinsic corresponds to the <c> CVTPI2PS + COMPOSITE </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of [4 x i16]. The elements of the destination are copied			/// A 64-bit vector of [4 x i16]. The elements of the destination are copied
	/// from the corresponding elements in this operand.			/// from the corresponding elements in this operand.
	/// \returns A 128-bit vector of [4 x float] containing the copied and converted			/// \returns A 128-bit vector of [4 x float] containing the copied and converted
	/// values from the operand.			/// values from the operand.
	static __inline__ __m128 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m128 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtpi16_ps(__m64 __a)			_mm_cvtpi16_ps(__m64 __a)
	{			{
	__m64 __b, __c;			return __builtin_convertvector((__v4hi)__a, __v4sf);
	__m128 __r;

	__b = _mm_setzero_si64();
	__b = _mm_cmpgt_pi16(__b, __a);
	__c = _mm_unpackhi_pi16(__a, __b);
	__r = _mm_setzero_ps();
	__r = _mm_cvtpi32_ps(__r, __c);
	__r = _mm_movelh_ps(__r, __r);
	__c = _mm_unpacklo_pi16(__a, __b);
	__r = _mm_cvtpi32_ps(__r, __c);

	return __r;
	}			}

	/// Converts a 64-bit vector of 16-bit unsigned integer values into a			/// Converts a 64-bit vector of 16-bit unsigned integer values into a
	/// 128-bit vector of [4 x float].			/// 128-bit vector of [4 x float].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTPI2PS + COMPOSITE </c> instruction.			/// This intrinsic corresponds to the <c> CVTPI2PS + COMPOSITE </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of 16-bit unsigned integer values. The elements of the			/// A 64-bit vector of 16-bit unsigned integer values. The elements of the
	/// destination are copied from the corresponding elements in this operand.			/// destination are copied from the corresponding elements in this operand.
	/// \returns A 128-bit vector of [4 x float] containing the copied and converted			/// \returns A 128-bit vector of [4 x float] containing the copied and converted
	/// values from the operand.			/// values from the operand.
	static __inline__ __m128 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m128 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtpu16_ps(__m64 __a)			_mm_cvtpu16_ps(__m64 __a)
	{			{
	__m64 __b, __c;			return __builtin_convertvector((__v4hu)__a, __v4sf);
	__m128 __r;

	__b = _mm_setzero_si64();
	__c = _mm_unpackhi_pi16(__a, __b);
	__r = _mm_setzero_ps();
	__r = _mm_cvtpi32_ps(__r, __c);
	__r = _mm_movelh_ps(__r, __r);
	__c = _mm_unpacklo_pi16(__a, __b);
	__r = _mm_cvtpi32_ps(__r, __c);

	return __r;
	}			}

	/// Converts the lower four 8-bit values from a 64-bit vector of [8 x i8]			/// Converts the lower four 8-bit values from a 64-bit vector of [8 x i8]
	/// into a 128-bit vector of [4 x float].			/// into a 128-bit vector of [4 x float].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTPI2PS + COMPOSITE </c> instruction.			/// This intrinsic corresponds to the <c> CVTPI2PS + COMPOSITE </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of [8 x i8]. The elements of the destination are copied			/// A 64-bit vector of [8 x i8]. The elements of the destination are copied
	/// from the corresponding lower 4 elements in this operand.			/// from the corresponding lower 4 elements in this operand.
	/// \returns A 128-bit vector of [4 x float] containing the copied and converted			/// \returns A 128-bit vector of [4 x float] containing the copied and converted
	/// values from the operand.			/// values from the operand.
	static __inline__ __m128 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m128 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtpi8_ps(__m64 __a)			_mm_cvtpi8_ps(__m64 __a)
	{			{
	__m64 __b;			return __builtin_convertvector(
				__builtin_shufflevector((__v8qs)__a, __extension__ (__v8qs){},
	__b = _mm_setzero_si64();			0, 1, 2, 3), __v4sf);
	__b = _mm_cmpgt_pi8(__b, __a);
	__b = _mm_unpacklo_pi8(__a, __b);

	return _mm_cvtpi16_ps(__b);
	}			}

	/// Converts the lower four unsigned 8-bit integer values from a 64-bit			/// Converts the lower four unsigned 8-bit integer values from a 64-bit
	/// vector of [8 x u8] into a 128-bit vector of [4 x float].			/// vector of [8 x u8] into a 128-bit vector of [4 x float].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTPI2PS + COMPOSITE </c> instruction.			/// This intrinsic corresponds to the <c> CVTPI2PS + COMPOSITE </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of unsigned 8-bit integer values. The elements of the			/// A 64-bit vector of unsigned 8-bit integer values. The elements of the
	/// destination are copied from the corresponding lower 4 elements in this			/// destination are copied from the corresponding lower 4 elements in this
	/// operand.			/// operand.
	/// \returns A 128-bit vector of [4 x float] containing the copied and converted			/// \returns A 128-bit vector of [4 x float] containing the copied and converted
	/// values from the source operand.			/// values from the source operand.
	static __inline__ __m128 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m128 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtpu8_ps(__m64 __a)			_mm_cvtpu8_ps(__m64 __a)
	{			{
	__m64 __b;			return __builtin_convertvector(
				__builtin_shufflevector((__v8qu)__a, __extension__ (__v8qu){},
	__b = _mm_setzero_si64();			0, 1, 2, 3), __v4sf);
	__b = _mm_unpacklo_pi8(__a, __b);

	return _mm_cvtpi16_ps(__b);
	}			}

	/// Converts the two 32-bit signed integer values from each 64-bit vector			/// Converts the two 32-bit signed integer values from each 64-bit vector
	/// operand of [2 x i32] into a 128-bit vector of [4 x float].			/// operand of [2 x i32] into a 128-bit vector of [4 x float].
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTPI2PS + COMPOSITE </c> instruction.			/// This intrinsic corresponds to the <c> CVTPI2PS + COMPOSITE </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 64-bit vector of [2 x i32]. The lower elements of the destination are			/// A 64-bit vector of [2 x i32]. The lower elements of the destination are
	/// copied from the elements in this operand.			/// copied from the elements in this operand.
	/// \param __b			/// \param __b
	/// A 64-bit vector of [2 x i32]. The upper elements of the destination are			/// A 64-bit vector of [2 x i32]. The upper elements of the destination are
	/// copied from the elements in this operand.			/// copied from the elements in this operand.
	/// \returns A 128-bit vector of [4 x float] whose lower 64 bits contain the			/// \returns A 128-bit vector of [4 x float] whose lower 64 bits contain the
	/// copied and converted values from the first operand. The upper 64 bits			/// copied and converted values from the first operand. The upper 64 bits
	/// contain the copied and converted values from the second operand.			/// contain the copied and converted values from the second operand.
	static __inline__ __m128 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m128 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtpi32x2_ps(__m64 __a, __m64 __b)			_mm_cvtpi32x2_ps(__m64 __a, __m64 __b)
	{			{
	__m128 __c;			return __builtin_convertvector(
				__builtin_shufflevector((__v2si)__a, (__v2si)__b,
	__c = _mm_setzero_ps();			0, 1, 2, 3), __v4sf);
	__c = _mm_cvtpi32_ps(__c, __b);
	__c = _mm_movelh_ps(__c, __c);

	return _mm_cvtpi32_ps(__c, __a);
	}			}

	/// Converts each single-precision floating-point element of a 128-bit			/// Converts each single-precision floating-point element of a 128-bit
	/// floating-point vector of [4 x float] into a 16-bit signed integer, and			/// floating-point vector of [4 x float] into a 16-bit signed integer, and
	/// packs the results into a 64-bit integer vector of [4 x i16].			/// packs the results into a 64-bit integer vector of [4 x i16].
	///			///
	/// If the floating-point element is NaN or infinity, or if the			/// If the floating-point element is NaN or infinity, or if the
	/// floating-point element is greater than 0x7FFFFFFF or less than -0x8000,			/// floating-point element is greater than 0x7FFFFFFF or less than -0x8000,
	/// it is converted to 0x8000. Otherwise if the floating-point element is			/// it is converted to 0x8000. Otherwise if the floating-point element is
	/// greater than 0x7FFF, it is converted to 0x7FFF.			/// greater than 0x7FFF, it is converted to 0x7FFF.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTPS2PI + COMPOSITE </c> instruction.			/// This intrinsic corresponds to the <c> CVTPS2PI + COMPOSITE </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// A 128-bit floating-point vector of [4 x float].			/// A 128-bit floating-point vector of [4 x float].
	/// \returns A 64-bit integer vector of [4 x i16] containing the converted			/// \returns A 64-bit integer vector of [4 x i16] containing the converted
	/// values.			/// values.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtps_pi16(__m128 __a)			_mm_cvtps_pi16(__m128 __a)
	{			{
	__m64 __b, __c;			return __trunc64(__builtin_ia32_packssdw128(
				(__v4si)__builtin_ia32_cvtps2dq((__v4sf)__a), (__v4si)_mm_setzero_ps()));
	__b = _mm_cvtps_pi32(__a);
	__a = _mm_movehl_ps(__a, __a);
	__c = _mm_cvtps_pi32(__a);

	return _mm_packs_pi32(__b, __c);
	}			}

	/// Converts each single-precision floating-point element of a 128-bit			/// Converts each single-precision floating-point element of a 128-bit
	/// floating-point vector of [4 x float] into an 8-bit signed integer, and			/// floating-point vector of [4 x float] into an 8-bit signed integer, and
	/// packs the results into the lower 32 bits of a 64-bit integer vector of			/// packs the results into the lower 32 bits of a 64-bit integer vector of
	/// [8 x i8]. The upper 32 bits of the vector are set to 0.			/// [8 x i8]. The upper 32 bits of the vector are set to 0.
	///			///
	/// If the floating-point element is NaN or infinity, or if the			/// If the floating-point element is NaN or infinity, or if the
	/// floating-point element is greater than 0x7FFFFFFF or less than -0x80, it			/// floating-point element is greater than 0x7FFFFFFF or less than -0x80, it
	/// is converted to 0x80. Otherwise if the floating-point element is greater			/// is converted to 0x80. Otherwise if the floating-point element is greater
	/// than 0x7F, it is converted to 0x7F.			/// than 0x7F, it is converted to 0x7F.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> CVTPS2PI + COMPOSITE </c> instruction.			/// This intrinsic corresponds to the <c> CVTPS2PI + COMPOSITE </c> instruction.
	///			///
	/// \param __a			/// \param __a
	/// 128-bit floating-point vector of [4 x float].			/// 128-bit floating-point vector of [4 x float].
	/// \returns A 64-bit integer vector of [8 x i8]. The lower 32 bits contain the			/// \returns A 64-bit integer vector of [8 x i8]. The lower 32 bits contain the
	/// converted values and the uppper 32 bits are set to zero.			/// converted values and the uppper 32 bits are set to zero.
	static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX			static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
	_mm_cvtps_pi8(__m128 __a)			_mm_cvtps_pi8(__m128 __a)
	{			{
	__m64 __b, __c;			__m64 __b, __c;

	__b = _mm_cvtps_pi16(__a);			__b = _mm_cvtps_pi16(__a);
	__c = _mm_setzero_si64();			__c = _mm_setzero_si64();

	return _mm_packs_pi16(__b, __c);			return _mm_packs_pi16(__b, __c);
	▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines
	#define _m_pshufw _mm_shuffle_pi16			#define _m_pshufw _mm_shuffle_pi16
	#define _m_maskmovq _mm_maskmove_si64			#define _m_maskmovq _mm_maskmove_si64
	#define _m_pavgb _mm_avg_pu8			#define _m_pavgb _mm_avg_pu8
	#define _m_pavgw _mm_avg_pu16			#define _m_pavgw _mm_avg_pu16
	#define _m_psadbw _mm_sad_pu8			#define _m_psadbw _mm_sad_pu8
	#define _m_ _mm_			#define _m_ _mm_
	#define _m_ _mm_			#define _m_ _mm_

				#undef __trunc64
				#undef __zext128
				#undef __anyext128
				#undef __zeroupper64
	#undef __DEFAULT_FN_ATTRS			#undef __DEFAULT_FN_ATTRS
	#undef __DEFAULT_FN_ATTRS_MMX			#undef __DEFAULT_FN_ATTRS_SSE2

	/* Ugly hack for backwards-compatibility (compatible with gcc) */			/* Ugly hack for backwards-compatibility (compatible with gcc) */
	#if defined(__SSE2__) && !__building_module(_Builtin_intrinsics)			#if defined(__SSE2__) && !__building_module(_Builtin_intrinsics)
	#include <emmintrin.h>			#include <emmintrin.h>
	#endif			#endif

	#endif /* __XMMINTRIN_H */			#endif /* __XMMINTRIN_H */

clang/test/CodeGen/X86/mmx-builtins.c

// RUN: %clang_cc1 -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +ssse3 -emit-llvm -o - -Wall -Werror \| FileCheck %s		// RUN: %clang_cc1 -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +ssse3 -emit-llvm -o - -Wall -Werror \| FileCheck %s --implicit-check-not=x86mmx
// RUN: %clang_cc1 -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +ssse3 -fno-signed-char -emit-llvm -o - -Wall -Werror \| FileCheck %s		// RUN: %clang_cc1 -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +ssse3 -fno-signed-char -emit-llvm -o - -Wall -Werror \| FileCheck %s --implicit-check-not=x86mmx


#include <immintrin.h>		#include <immintrin.h>

__m64 test_mm_abs_pi8(__m64 a) {		__m64 test_mm_abs_pi8(__m64 a) {
// CHECK-LABEL: test_mm_abs_pi8		// CHECK-LABEL: test_mm_abs_pi8
// CHECK: call x86_mmx @llvm.x86.ssse3.pabs.b		// CHECK: call <16 x i8> @llvm.abs.v16i8(
return _mm_abs_pi8(a);		return _mm_abs_pi8(a);
}		}

__m64 test_mm_abs_pi16(__m64 a) {		__m64 test_mm_abs_pi16(__m64 a) {
// CHECK-LABEL: test_mm_abs_pi16		// CHECK-LABEL: test_mm_abs_pi16
// CHECK: call x86_mmx @llvm.x86.ssse3.pabs.w		// CHECK: call <8 x i16> @llvm.abs.v8i16(
return _mm_abs_pi16(a);		return _mm_abs_pi16(a);
}		}

__m64 test_mm_abs_pi32(__m64 a) {		__m64 test_mm_abs_pi32(__m64 a) {
// CHECK-LABEL: test_mm_abs_pi32		// CHECK-LABEL: test_mm_abs_pi32
// CHECK: call x86_mmx @llvm.x86.ssse3.pabs.d		// CHECK: call <4 x i32> @llvm.abs.v4i32(
return _mm_abs_pi32(a);		return _mm_abs_pi32(a);
}		}

__m64 test_mm_add_pi8(__m64 a, __m64 b) {		__m64 test_mm_add_pi8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_add_pi8		// CHECK-LABEL: test_mm_add_pi8
// CHECK: call x86_mmx @llvm.x86.mmx.padd.b		// CHECK: add <8 x i8> {{%.}}, {{%.}}
return _mm_add_pi8(a, b);		return _mm_add_pi8(a, b);
}		}

__m64 test_mm_add_pi16(__m64 a, __m64 b) {		__m64 test_mm_add_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_add_pi16		// CHECK-LABEL: test_mm_add_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.padd.w		// CHECK: add <4 x i16> {{%.}}, {{%.}}
return _mm_add_pi16(a, b);		return _mm_add_pi16(a, b);
}		}

__m64 test_mm_add_pi32(__m64 a, __m64 b) {		__m64 test_mm_add_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_add_pi32		// CHECK-LABEL: test_mm_add_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.padd.d		// CHECK: add <2 x i32> {{%.}}, {{%.}}
return _mm_add_pi32(a, b);		return _mm_add_pi32(a, b);
}		}

__m64 test_mm_add_si64(__m64 a, __m64 b) {		__m64 test_mm_add_si64(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_add_si64		// CHECK-LABEL: test_mm_add_si64
// CHECK: call x86_mmx @llvm.x86.mmx.padd.q(x86_mmx %{{.}}, x86_mmx %{{.}})		// CHECK: add i64 {{%.}}, {{%.}}
return _mm_add_si64(a, b);		return _mm_add_si64(a, b);
}		}

__m64 test_mm_adds_pi8(__m64 a, __m64 b) {		__m64 test_mm_adds_pi8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_adds_pi8		// CHECK-LABEL: test_mm_adds_pi8
// CHECK: call x86_mmx @llvm.x86.mmx.padds.b		// CHECK: call <16 x i8> @llvm.sadd.sat.v16i8(
return _mm_adds_pi8(a, b);		return _mm_adds_pi8(a, b);
}		}

__m64 test_mm_adds_pi16(__m64 a, __m64 b) {		__m64 test_mm_adds_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_adds_pi16		// CHECK-LABEL: test_mm_adds_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.padds.w		// CHECK: call <8 x i16> @llvm.sadd.sat.v8i16(
return _mm_adds_pi16(a, b);		return _mm_adds_pi16(a, b);
}		}

__m64 test_mm_adds_pu8(__m64 a, __m64 b) {		__m64 test_mm_adds_pu8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_adds_pu8		// CHECK-LABEL: test_mm_adds_pu8
// CHECK: call x86_mmx @llvm.x86.mmx.paddus.b		// CHECK: call <16 x i8> @llvm.uadd.sat.v16i8(
return _mm_adds_pu8(a, b);		return _mm_adds_pu8(a, b);
}		}

__m64 test_mm_adds_pu16(__m64 a, __m64 b) {		__m64 test_mm_adds_pu16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_adds_pu16		// CHECK-LABEL: test_mm_adds_pu16
// CHECK: call x86_mmx @llvm.x86.mmx.paddus.w		// CHECK: call <8 x i16> @llvm.uadd.sat.v8i16(
return _mm_adds_pu16(a, b);		return _mm_adds_pu16(a, b);
}		}

__m64 test_mm_alignr_pi8(__m64 a, __m64 b) {		__m64 test_mm_alignr_pi8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_alignr_pi8		// CHECK-LABEL: test_mm_alignr_pi8
// CHECK: call x86_mmx @llvm.x86.mmx.palignr.b		// CHECK: shufflevector <16 x i8> {{%.*}}, <16 x i8> zeroinitializer, <16 x i32> <i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17>
return _mm_alignr_pi8(a, b, 2);		return _mm_alignr_pi8(a, b, 2);
}		}

__m64 test_mm_and_si64(__m64 a, __m64 b) {		__m64 test_mm_and_si64(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_and_si64		// CHECK-LABEL: test_mm_and_si64
// CHECK: call x86_mmx @llvm.x86.mmx.pand		// CHECK: and <1 x i64> {{%.}}, {{%.}}
return _mm_and_si64(a, b);		return _mm_and_si64(a, b);
}		}

__m64 test_mm_andnot_si64(__m64 a, __m64 b) {		__m64 test_mm_andnot_si64(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_andnot_si64		// CHECK-LABEL: test_mm_andnot_si64
// CHECK: call x86_mmx @llvm.x86.mmx.pandn		// CHECK: [[TMP:%.]] = xor <1 x i64> {{%.}}, <i64 -1>
		// CHECK: and <1 x i64> [[TMP]], {{%.*}}
return _mm_andnot_si64(a, b);		return _mm_andnot_si64(a, b);
}		}

__m64 test_mm_avg_pu8(__m64 a, __m64 b) {		__m64 test_mm_avg_pu8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_avg_pu8		// CHECK-LABEL: test_mm_avg_pu8
// CHECK: call x86_mmx @llvm.x86.mmx.pavg.b		// CHECK: call <16 x i8> @llvm.x86.sse2.pavg.b(
return _mm_avg_pu8(a, b);		return _mm_avg_pu8(a, b);
}		}

__m64 test_mm_avg_pu16(__m64 a, __m64 b) {		__m64 test_mm_avg_pu16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_avg_pu16		// CHECK-LABEL: test_mm_avg_pu16
// CHECK: call x86_mmx @llvm.x86.mmx.pavg.w		// CHECK: call <8 x i16> @llvm.x86.sse2.pavg.w(
return _mm_avg_pu16(a, b);		return _mm_avg_pu16(a, b);
}		}

__m64 test_mm_cmpeq_pi8(__m64 a, __m64 b) {		__m64 test_mm_cmpeq_pi8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_cmpeq_pi8		// CHECK-LABEL: test_mm_cmpeq_pi8
// CHECK: call x86_mmx @llvm.x86.mmx.pcmpeq.b		// CHECK: [[CMP:%.]] = icmp eq <8 x i8> {{%.}}, {{%.*}}
		// CHECK-NEXT: {{%.*}} = sext <8 x i1> [[CMP]] to <8 x i8>
return _mm_cmpeq_pi8(a, b);		return _mm_cmpeq_pi8(a, b);
}		}

__m64 test_mm_cmpeq_pi16(__m64 a, __m64 b) {		__m64 test_mm_cmpeq_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_cmpeq_pi16		// CHECK-LABEL: test_mm_cmpeq_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.pcmpeq.w		// CHECK: [[CMP:%.]] = icmp eq <4 x i16> {{%.}}, {{%.*}}
		// CHECK-NEXT: {{%.*}} = sext <4 x i1> [[CMP]] to <4 x i16>
return _mm_cmpeq_pi16(a, b);		return _mm_cmpeq_pi16(a, b);
}		}

__m64 test_mm_cmpeq_pi32(__m64 a, __m64 b) {		__m64 test_mm_cmpeq_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_cmpeq_pi32		// CHECK-LABEL: test_mm_cmpeq_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.pcmpeq.d		// CHECK: [[CMP:%.]] = icmp eq <2 x i32> {{%.}}, {{%.*}}
		// CHECK-NEXT: {{%.*}} = sext <2 x i1> [[CMP]] to <2 x i32>
return _mm_cmpeq_pi32(a, b);		return _mm_cmpeq_pi32(a, b);
}		}

__m64 test_mm_cmpgt_pi8(__m64 a, __m64 b) {		__m64 test_mm_cmpgt_pi8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_cmpgt_pi8		// CHECK-LABEL: test_mm_cmpgt_pi8
// CHECK: call x86_mmx @llvm.x86.mmx.pcmpgt.b		// CHECK: [[CMP:%.]] = icmp sgt <8 x i8> {{%.}}, {{%.*}}
		// CHECK-NEXT: {{%.*}} = sext <8 x i1> [[CMP]] to <8 x i8>
return _mm_cmpgt_pi8(a, b);		return _mm_cmpgt_pi8(a, b);
}		}

__m64 test_mm_cmpgt_pi16(__m64 a, __m64 b) {		__m64 test_mm_cmpgt_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_cmpgt_pi16		// CHECK-LABEL: test_mm_cmpgt_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.pcmpgt.w		// CHECK: [[CMP:%.]] = icmp sgt <4 x i16> {{%.}}, {{%.*}}
		// CHECK-NEXT: {{%.*}} = sext <4 x i1> [[CMP]] to <4 x i16>
return _mm_cmpgt_pi16(a, b);		return _mm_cmpgt_pi16(a, b);
}		}

__m64 test_mm_cmpgt_pi32(__m64 a, __m64 b) {		__m64 test_mm_cmpgt_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_cmpgt_pi32		// CHECK-LABEL: test_mm_cmpgt_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.pcmpgt.d		// CHECK: [[CMP:%.]] = icmp sgt <2 x i32> {{%.}}, {{%.*}}
		// CHECK-NEXT: {{%.*}} = sext <2 x i1> [[CMP]] to <2 x i32>
return _mm_cmpgt_pi32(a, b);		return _mm_cmpgt_pi32(a, b);
}		}

__m128 test_mm_cvt_pi2ps(__m128 a, __m64 b) {		__m128 test_mm_cvt_pi2ps(__m128 a, __m64 b) {
// CHECK-LABEL: test_mm_cvt_pi2ps		// CHECK-LABEL: test_mm_cvt_pi2ps
// CHECK: <4 x float> @llvm.x86.sse.cvtpi2ps		// CHECK: sitofp <4 x i32> {{%.*}} to <4 x float>
return _mm_cvt_pi2ps(a, b);		return _mm_cvt_pi2ps(a, b);
}		}

__m64 test_mm_cvt_ps2pi(__m128 a) {		__m64 test_mm_cvt_ps2pi(__m128 a) {
// CHECK-LABEL: test_mm_cvt_ps2pi		// CHECK-LABEL: test_mm_cvt_ps2pi
// CHECK: call x86_mmx @llvm.x86.sse.cvtps2pi		// CHECK: call <4 x i32> @llvm.x86.sse2.cvtps2dq(
return _mm_cvt_ps2pi(a);		return _mm_cvt_ps2pi(a);
}		}

__m64 test_mm_cvtpd_pi32(__m128d a) {		__m64 test_mm_cvtpd_pi32(__m128d a) {
// CHECK-LABEL: test_mm_cvtpd_pi32		// CHECK-LABEL: test_mm_cvtpd_pi32
// CHECK: call x86_mmx @llvm.x86.sse.cvtpd2pi		// CHECK: call <4 x i32> @llvm.x86.sse2.cvtpd2dq(
return _mm_cvtpd_pi32(a);		return _mm_cvtpd_pi32(a);
}		}

__m128 test_mm_cvtpi16_ps(__m64 a) {		__m128 test_mm_cvtpi16_ps(__m64 a) {
// CHECK-LABEL: test_mm_cvtpi16_ps		// CHECK-LABEL: test_mm_cvtpi16_ps
// CHECK: call <4 x float> @llvm.x86.sse.cvtpi2ps		// CHECK: sitofp <4 x i16> {{%.*}} to <4 x float>
return _mm_cvtpi16_ps(a);		return _mm_cvtpi16_ps(a);
}		}

__m128d test_mm_cvtpi32_pd(__m64 a) {		__m128d test_mm_cvtpi32_pd(__m64 a) {
// CHECK-LABEL: test_mm_cvtpi32_pd		// CHECK-LABEL: test_mm_cvtpi32_pd
// CHECK: call <2 x double> @llvm.x86.sse.cvtpi2pd		// CHECK: sitofp <2 x i32> {{%.*}} to <2 x double>
return _mm_cvtpi32_pd(a);		return _mm_cvtpi32_pd(a);
}		}

__m128 test_mm_cvtpi32_ps(__m128 a, __m64 b) {		__m128 test_mm_cvtpi32_ps(__m128 a, __m64 b) {
// CHECK-LABEL: test_mm_cvtpi32_ps		// CHECK-LABEL: test_mm_cvtpi32_ps
// CHECK: call <4 x float> @llvm.x86.sse.cvtpi2ps		// CHECK: sitofp <4 x i32> {{%.*}} to <4 x float>
return _mm_cvtpi32_ps(a, b);		return _mm_cvtpi32_ps(a, b);
}		}

__m128 test_mm_cvtpi32x2_ps(__m64 a, __m64 b) {		__m128 test_mm_cvtpi32x2_ps(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_cvtpi32x2_ps		// CHECK-LABEL: test_mm_cvtpi32x2_ps
// CHECK: call <4 x float> @llvm.x86.sse.cvtpi2ps		// CHECK: sitofp <4 x i32> {{%.*}} to <4 x float>
// CHECK: call <4 x float> @llvm.x86.sse.cvtpi2ps
return _mm_cvtpi32x2_ps(a, b);		return _mm_cvtpi32x2_ps(a, b);
}		}

__m64 test_mm_cvtps_pi16(__m128 a) {		__m64 test_mm_cvtps_pi16(__m128 a) {
// CHECK-LABEL: test_mm_cvtps_pi16		// CHECK-LABEL: test_mm_cvtps_pi16
// CHECK: call x86_mmx @llvm.x86.sse.cvtps2pi		// CHECK: [[TMP0:%.]] = call <4 x i32> @llvm.x86.sse2.cvtps2dq(<4 x float> {{%.}})
		// CHECK: call <8 x i16> @llvm.x86.sse2.packssdw.128(<4 x i32> [[TMP0]],
return _mm_cvtps_pi16(a);		return _mm_cvtps_pi16(a);
}		}

__m64 test_mm_cvtps_pi32(__m128 a) {		__m64 test_mm_cvtps_pi32(__m128 a) {
// CHECK-LABEL: test_mm_cvtps_pi32		// CHECK-LABEL: test_mm_cvtps_pi32
// CHECK: call x86_mmx @llvm.x86.sse.cvtps2pi		// CHECK: call <4 x i32> @llvm.x86.sse2.cvtps2dq(
return _mm_cvtps_pi32(a);		return _mm_cvtps_pi32(a);
}		}

__m64 test_mm_cvtsi32_si64(int a) {		__m64 test_mm_cvtsi32_si64(int a) {
// CHECK-LABEL: test_mm_cvtsi32_si64		// CHECK-LABEL: test_mm_cvtsi32_si64
// CHECK: insertelement <2 x i32>		// CHECK: insertelement <2 x i32>
return _mm_cvtsi32_si64(a);		return _mm_cvtsi32_si64(a);
}		}

int test_mm_cvtsi64_si32(__m64 a) {		int test_mm_cvtsi64_si32(__m64 a) {
// CHECK-LABEL: test_mm_cvtsi64_si32		// CHECK-LABEL: test_mm_cvtsi64_si32
// CHECK: extractelement <2 x i32>		// CHECK: extractelement <2 x i32>
return _mm_cvtsi64_si32(a);		return _mm_cvtsi64_si32(a);
}		}

__m64 test_mm_cvttpd_pi32(__m128d a) {		__m64 test_mm_cvttpd_pi32(__m128d a) {
// CHECK-LABEL: test_mm_cvttpd_pi32		// CHECK-LABEL: test_mm_cvttpd_pi32
// CHECK: call x86_mmx @llvm.x86.sse.cvttpd2pi		// CHECK: call <4 x i32> @llvm.x86.sse2.cvttpd2dq(
return _mm_cvttpd_pi32(a);		return _mm_cvttpd_pi32(a);
}		}

__m64 test_mm_cvttps_pi32(__m128 a) {		__m64 test_mm_cvttps_pi32(__m128 a) {
// CHECK-LABEL: test_mm_cvttps_pi32		// CHECK-LABEL: test_mm_cvttps_pi32
// CHECK: call x86_mmx @llvm.x86.sse.cvttps2pi		// CHECK: call <4 x i32> @llvm.x86.sse2.cvttps2dq(
return _mm_cvttps_pi32(a);		return _mm_cvttps_pi32(a);
}		}

int test_mm_extract_pi16(__m64 a) {		int test_mm_extract_pi16(__m64 a) {
// CHECK-LABEL: test_mm_extract_pi16		// CHECK-LABEL: test_mm_extract_pi16
// CHECK: call i32 @llvm.x86.mmx.pextr.w		// CHECK: extractelement <4 x i16> {{%.*}}, i64 2
return _mm_extract_pi16(a, 2);		return _mm_extract_pi16(a, 2);
}		}

__m64 test_m_from_int(int a) {		__m64 test_m_from_int(int a) {
// CHECK-LABEL: test_m_from_int		// CHECK-LABEL: test_m_from_int
// CHECK: insertelement <2 x i32>		// CHECK: insertelement <2 x i32>
return _m_from_int(a);		return _m_from_int(a);
}		}

__m64 test_m_from_int64(long long a) {		__m64 test_m_from_int64(long long a) {
// CHECK-LABEL: test_m_from_int64		// CHECK-LABEL: test_m_from_int64
// CHECK: bitcast		// CHECK: bitcast
return _m_from_int64(a);		return _m_from_int64(a);
}		}

__m64 test_mm_hadd_pi16(__m64 a, __m64 b) {		__m64 test_mm_hadd_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_hadd_pi16		// CHECK-LABEL: test_mm_hadd_pi16
// CHECK: call x86_mmx @llvm.x86.ssse3.phadd.w		// CHECK: call <8 x i16> @llvm.x86.ssse3.phadd.w.128(
return _mm_hadd_pi16(a, b);		return _mm_hadd_pi16(a, b);
}		}

__m64 test_mm_hadd_pi32(__m64 a, __m64 b) {		__m64 test_mm_hadd_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_hadd_pi32		// CHECK-LABEL: test_mm_hadd_pi32
// CHECK: call x86_mmx @llvm.x86.ssse3.phadd.d		// CHECK: call <4 x i32> @llvm.x86.ssse3.phadd.d.128(
return _mm_hadd_pi32(a, b);		return _mm_hadd_pi32(a, b);
}		}

__m64 test_mm_hadds_pi16(__m64 a, __m64 b) {		__m64 test_mm_hadds_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_hadds_pi16		// CHECK-LABEL: test_mm_hadds_pi16
// CHECK: call x86_mmx @llvm.x86.ssse3.phadd.sw		// CHECK: call <8 x i16> @llvm.x86.ssse3.phadd.sw.128(
return _mm_hadds_pi16(a, b);		return _mm_hadds_pi16(a, b);
}		}

__m64 test_mm_hsub_pi16(__m64 a, __m64 b) {		__m64 test_mm_hsub_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_hsub_pi16		// CHECK-LABEL: test_mm_hsub_pi16
// CHECK: call x86_mmx @llvm.x86.ssse3.phsub.w		// CHECK: call <8 x i16> @llvm.x86.ssse3.phsub.w.128(
return _mm_hsub_pi16(a, b);		return _mm_hsub_pi16(a, b);
}		}

__m64 test_mm_hsub_pi32(__m64 a, __m64 b) {		__m64 test_mm_hsub_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_hsub_pi32		// CHECK-LABEL: test_mm_hsub_pi32
// CHECK: call x86_mmx @llvm.x86.ssse3.phsub.d		// CHECK: call <4 x i32> @llvm.x86.ssse3.phsub.d.128(
return _mm_hsub_pi32(a, b);		return _mm_hsub_pi32(a, b);
}		}

__m64 test_mm_hsubs_pi16(__m64 a, __m64 b) {		__m64 test_mm_hsubs_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_hsubs_pi16		// CHECK-LABEL: test_mm_hsubs_pi16
// CHECK: call x86_mmx @llvm.x86.ssse3.phsub.sw		// CHECK: call <8 x i16> @llvm.x86.ssse3.phsub.sw.128(
return _mm_hsubs_pi16(a, b);		return _mm_hsubs_pi16(a, b);
}		}

__m64 test_mm_insert_pi16(__m64 a, int d) {		__m64 test_mm_insert_pi16(__m64 a, int d) {
// CHECK-LABEL: test_mm_insert_pi16		// CHECK-LABEL: test_mm_insert_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.pinsr.w		// CHECK: insertelement <4 x i16>
return _mm_insert_pi16(a, d, 2);		return _mm_insert_pi16(a, d, 2);
}		}

__m64 test_mm_madd_pi16(__m64 a, __m64 b) {		__m64 test_mm_madd_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_madd_pi16		// CHECK-LABEL: test_mm_madd_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.pmadd.wd		// CHECK: call <4 x i32> @llvm.x86.sse2.pmadd.wd(
return _mm_madd_pi16(a, b);		return _mm_madd_pi16(a, b);
}		}

__m64 test_mm_maddubs_pi16(__m64 a, __m64 b) {		__m64 test_mm_maddubs_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_maddubs_pi16		// CHECK-LABEL: test_mm_maddubs_pi16
// CHECK: call x86_mmx @llvm.x86.ssse3.pmadd.ub.sw		// CHECK: call <8 x i16> @llvm.x86.ssse3.pmadd.ub.sw.128(
return _mm_maddubs_pi16(a, b);		return _mm_maddubs_pi16(a, b);
}		}

void test_mm_maskmove_si64(__m64 d, __m64 n, char *p) {		void test_mm_maskmove_si64(__m64 d, __m64 n, char *p) {
// CHECK-LABEL: test_mm_maskmove_si64		// CHECK-LABEL: test_mm_maskmove_si64
// CHECK: call void @llvm.x86.mmx.maskmovq		// CHECK: call void @llvm.x86.sse2.maskmov.dqu(
_mm_maskmove_si64(d, n, p);		_mm_maskmove_si64(d, n, p);
}		}

__m64 test_mm_max_pi16(__m64 a, __m64 b) {		__m64 test_mm_max_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_max_pi16		// CHECK-LABEL: test_mm_max_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.pmaxs.w		// CHECK: call <8 x i16> @llvm.smax.v8i16(
return _mm_max_pi16(a, b);		return _mm_max_pi16(a, b);
}		}

__m64 test_mm_max_pu8(__m64 a, __m64 b) {		__m64 test_mm_max_pu8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_max_pu8		// CHECK-LABEL: test_mm_max_pu8
// CHECK: call x86_mmx @llvm.x86.mmx.pmaxu.b		// CHECK: call <16 x i8> @llvm.umax.v16i8(
return _mm_max_pu8(a, b);		return _mm_max_pu8(a, b);
}		}

__m64 test_mm_min_pi16(__m64 a, __m64 b) {		__m64 test_mm_min_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_min_pi16		// CHECK-LABEL: test_mm_min_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.pmins.w		// CHECK: call <8 x i16> @llvm.smin.v8i16(
return _mm_min_pi16(a, b);		return _mm_min_pi16(a, b);
}		}

__m64 test_mm_min_pu8(__m64 a, __m64 b) {		__m64 test_mm_min_pu8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_min_pu8		// CHECK-LABEL: test_mm_min_pu8
// CHECK: call x86_mmx @llvm.x86.mmx.pminu.b		// CHECK: call <16 x i8> @llvm.umin.v16i8(
return _mm_min_pu8(a, b);		return _mm_min_pu8(a, b);
}		}

int test_mm_movemask_pi8(__m64 a) {		int test_mm_movemask_pi8(__m64 a) {
// CHECK-LABEL: test_mm_movemask_pi8		// CHECK-LABEL: test_mm_movemask_pi8
// CHECK: call i32 @llvm.x86.mmx.pmovmskb		// CHECK: call i32 @llvm.x86.sse2.pmovmskb.128(
return _mm_movemask_pi8(a);		return _mm_movemask_pi8(a);
}		}

__m64 test_mm_mul_su32(__m64 a, __m64 b) {		__m64 test_mm_mul_su32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_mul_su32		// CHECK-LABEL: test_mm_mul_su32
// CHECK: call x86_mmx @llvm.x86.mmx.pmulu.dq(x86_mmx %{{.}}, x86_mmx %{{.}})		// CHECK: and <2 x i64> {{%.*}}, <i64 4294967295, i64 4294967295>
		// CHECK: and <2 x i64> {{%.*}}, <i64 4294967295, i64 4294967295>
		// CHECK: mul <2 x i64> %{{.}}, %{{.}}
return _mm_mul_su32(a, b);		return _mm_mul_su32(a, b);
}		}

__m64 test_mm_mulhi_pi16(__m64 a, __m64 b) {		__m64 test_mm_mulhi_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_mulhi_pi16		// CHECK-LABEL: test_mm_mulhi_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.pmulh.w		// CHECK: call <8 x i16> @llvm.x86.sse2.pmulh.w(
return _mm_mulhi_pi16(a, b);		return _mm_mulhi_pi16(a, b);
}		}

__m64 test_mm_mulhi_pu16(__m64 a, __m64 b) {		__m64 test_mm_mulhi_pu16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_mulhi_pu16		// CHECK-LABEL: test_mm_mulhi_pu16
// CHECK: call x86_mmx @llvm.x86.mmx.pmulhu.w		// CHECK: call <8 x i16> @llvm.x86.sse2.pmulhu.w(
return _mm_mulhi_pu16(a, b);		return _mm_mulhi_pu16(a, b);
}		}

__m64 test_mm_mulhrs_pi16(__m64 a, __m64 b) {		__m64 test_mm_mulhrs_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_mulhrs_pi16		// CHECK-LABEL: test_mm_mulhrs_pi16
// CHECK: call x86_mmx @llvm.x86.ssse3.pmul.hr.sw		// CHECK: call <8 x i16> @llvm.x86.ssse3.pmul.hr.sw.128(
return _mm_mulhrs_pi16(a, b);		return _mm_mulhrs_pi16(a, b);
}		}

__m64 test_mm_mullo_pi16(__m64 a, __m64 b) {		__m64 test_mm_mullo_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_mullo_pi16		// CHECK-LABEL: test_mm_mullo_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.pmull.w		// CHECK: mul <4 x i16> {{%.}}, {{%.}}
return _mm_mullo_pi16(a, b);		return _mm_mullo_pi16(a, b);
}		}

__m64 test_mm_or_si64(__m64 a, __m64 b) {		__m64 test_mm_or_si64(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_or_si64		// CHECK-LABEL: test_mm_or_si64
// CHECK: call x86_mmx @llvm.x86.mmx.por		// CHECK: or <1 x i64> {{%.}}, {{%.}}
return _mm_or_si64(a, b);		return _mm_or_si64(a, b);
}		}

__m64 test_mm_packs_pi16(__m64 a, __m64 b) {		__m64 test_mm_packs_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_packs_pi16		// CHECK-LABEL: test_mm_packs_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.packsswb		// CHECK: call <16 x i8> @llvm.x86.sse2.packsswb.128(
return _mm_packs_pi16(a, b);		return _mm_packs_pi16(a, b);
}		}

__m64 test_mm_packs_pi32(__m64 a, __m64 b) {		__m64 test_mm_packs_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_packs_pi32		// CHECK-LABEL: test_mm_packs_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.packssdw		// CHECK: call <8 x i16> @llvm.x86.sse2.packssdw.128(
return _mm_packs_pi32(a, b);		return _mm_packs_pi32(a, b);
}		}

__m64 test_mm_packs_pu16(__m64 a, __m64 b) {		__m64 test_mm_packs_pu16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_packs_pu16		// CHECK-LABEL: test_mm_packs_pu16
// CHECK: call x86_mmx @llvm.x86.mmx.packuswb		// CHECK: call <16 x i8> @llvm.x86.sse2.packuswb.128(
return _mm_packs_pu16(a, b);		return _mm_packs_pu16(a, b);
}		}

__m64 test_mm_sad_pu8(__m64 a, __m64 b) {		__m64 test_mm_sad_pu8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sad_pu8		// CHECK-LABEL: test_mm_sad_pu8
// CHECK: call x86_mmx @llvm.x86.mmx.psad.bw		// CHECK: call <2 x i64> @llvm.x86.sse2.psad.bw(<16 x i8>
return _mm_sad_pu8(a, b);		return _mm_sad_pu8(a, b);
}		}

__m64 test_mm_set_pi8(char a, char b, char c, char d, char e, char f, char g, char h) {		__m64 test_mm_set_pi8(char a, char b, char c, char d, char e, char f, char g, char h) {
// CHECK-LABEL: test_mm_set_pi8		// CHECK-LABEL: test_mm_set_pi8
// CHECK: insertelement <8 x i8>		// CHECK: insertelement <8 x i8>
// CHECK: insertelement <8 x i8>		// CHECK: insertelement <8 x i8>
// CHECK: insertelement <8 x i8>		// CHECK: insertelement <8 x i8>
▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	__m64 test_mm_set1_pi32(int a) {
// CHECK-LABEL: test_mm_set1_pi32		// CHECK-LABEL: test_mm_set1_pi32
// CHECK: insertelement <2 x i32>		// CHECK: insertelement <2 x i32>
// CHECK: insertelement <2 x i32>		// CHECK: insertelement <2 x i32>
return _mm_set1_pi32(a);		return _mm_set1_pi32(a);
}		}

__m64 test_mm_shuffle_pi8(__m64 a, __m64 b) {		__m64 test_mm_shuffle_pi8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_shuffle_pi8		// CHECK-LABEL: test_mm_shuffle_pi8
// CHECK: call x86_mmx @llvm.x86.ssse3.pshuf.b		// CHECK: call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(
return _mm_shuffle_pi8(a, b);		return _mm_shuffle_pi8(a, b);
}		}

__m64 test_mm_shuffle_pi16(__m64 a) {		__m64 test_mm_shuffle_pi16(__m64 a) {
// CHECK-LABEL: test_mm_shuffle_pi16		// CHECK-LABEL: test_mm_shuffle_pi16
// CHECK: call x86_mmx @llvm.x86.sse.pshuf.w		// CHECK: shufflevector <4 x i16> {{%.}}, <4 x i16> {{%.}}, <4 x i32> <i32 3, i32 0, i32 0, i32 0>
return _mm_shuffle_pi16(a, 3);		return _mm_shuffle_pi16(a, 3);
}		}

__m64 test_mm_sign_pi8(__m64 a, __m64 b) {		__m64 test_mm_sign_pi8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sign_pi8		// CHECK-LABEL: test_mm_sign_pi8
// CHECK: call x86_mmx @llvm.x86.ssse3.psign.b		// CHECK: call <16 x i8> @llvm.x86.ssse3.psign.b.128(
return _mm_sign_pi8(a, b);		return _mm_sign_pi8(a, b);
}		}

__m64 test_mm_sign_pi16(__m64 a, __m64 b) {		__m64 test_mm_sign_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sign_pi16		// CHECK-LABEL: test_mm_sign_pi16
// CHECK: call x86_mmx @llvm.x86.ssse3.psign.w		// CHECK: call <8 x i16> @llvm.x86.ssse3.psign.w.128(
return _mm_sign_pi16(a, b);		return _mm_sign_pi16(a, b);
}		}

__m64 test_mm_sign_pi32(__m64 a, __m64 b) {		__m64 test_mm_sign_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sign_pi32		// CHECK-LABEL: test_mm_sign_pi32
// CHECK: call x86_mmx @llvm.x86.ssse3.psign.d		// CHECK: call <4 x i32> @llvm.x86.ssse3.psign.d.128(
return _mm_sign_pi32(a, b);		return _mm_sign_pi32(a, b);
}		}

__m64 test_mm_sll_pi16(__m64 a, __m64 b) {		__m64 test_mm_sll_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sll_pi16		// CHECK-LABEL: test_mm_sll_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.psll.w		// CHECK: call <8 x i16> @llvm.x86.sse2.psll.w(
return _mm_sll_pi16(a, b);		return _mm_sll_pi16(a, b);
}		}

__m64 test_mm_sll_pi32(__m64 a, __m64 b) {		__m64 test_mm_sll_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sll_pi32		// CHECK-LABEL: test_mm_sll_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.psll.d		// CHECK: call <4 x i32> @llvm.x86.sse2.psll.d(
return _mm_sll_pi32(a, b);		return _mm_sll_pi32(a, b);
}		}

__m64 test_mm_sll_si64(__m64 a, __m64 b) {		__m64 test_mm_sll_si64(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sll_si64		// CHECK-LABEL: test_mm_sll_si64
// CHECK: call x86_mmx @llvm.x86.mmx.psll.q		// CHECK: call <2 x i64> @llvm.x86.sse2.psll.q(
return _mm_sll_si64(a, b);		return _mm_sll_si64(a, b);
}		}

__m64 test_mm_slli_pi16(__m64 a) {		__m64 test_mm_slli_pi16(__m64 a) {
// CHECK-LABEL: test_mm_slli_pi16		// CHECK-LABEL: test_mm_slli_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.pslli.w		// CHECK: call <8 x i16> @llvm.x86.sse2.pslli.w(
return _mm_slli_pi16(a, 3);		return _mm_slli_pi16(a, 3);
}		}

__m64 test_mm_slli_pi32(__m64 a) {		__m64 test_mm_slli_pi32(__m64 a) {
// CHECK-LABEL: test_mm_slli_pi32		// CHECK-LABEL: test_mm_slli_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.pslli.d		// CHECK: call <4 x i32> @llvm.x86.sse2.pslli.d(
return _mm_slli_pi32(a, 3);		return _mm_slli_pi32(a, 3);
}		}

__m64 test_mm_slli_si64(__m64 a) {		__m64 test_mm_slli_si64(__m64 a) {
// CHECK-LABEL: test_mm_slli_si64		// CHECK-LABEL: test_mm_slli_si64
// CHECK: call x86_mmx @llvm.x86.mmx.pslli.q		// CHECK: call <2 x i64> @llvm.x86.sse2.pslli.q(
return _mm_slli_si64(a, 3);		return _mm_slli_si64(a, 3);
}		}

__m64 test_mm_sra_pi16(__m64 a, __m64 b) {		__m64 test_mm_sra_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sra_pi16		// CHECK-LABEL: test_mm_sra_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.psra.w		// CHECK: call <8 x i16> @llvm.x86.sse2.psra.w(
return _mm_sra_pi16(a, b);		return _mm_sra_pi16(a, b);
}		}

__m64 test_mm_sra_pi32(__m64 a, __m64 b) {		__m64 test_mm_sra_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sra_pi32		// CHECK-LABEL: test_mm_sra_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.psra.d		// CHECK: call <4 x i32> @llvm.x86.sse2.psra.d(
return _mm_sra_pi32(a, b);		return _mm_sra_pi32(a, b);
}		}

__m64 test_mm_srai_pi16(__m64 a) {		__m64 test_mm_srai_pi16(__m64 a) {
// CHECK-LABEL: test_mm_srai_pi16		// CHECK-LABEL: test_mm_srai_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.psrai.w		// CHECK: call <8 x i16> @llvm.x86.sse2.psrai.w(
return _mm_srai_pi16(a, 3);		return _mm_srai_pi16(a, 3);
}		}

__m64 test_mm_srai_pi32(__m64 a) {		__m64 test_mm_srai_pi32(__m64 a) {
// CHECK-LABEL: test_mm_srai_pi32		// CHECK-LABEL: test_mm_srai_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.psrai.d		// CHECK: call <4 x i32> @llvm.x86.sse2.psrai.d(
return _mm_srai_pi32(a, 3);		return _mm_srai_pi32(a, 3);
}		}

__m64 test_mm_srl_pi16(__m64 a, __m64 b) {		__m64 test_mm_srl_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_srl_pi16		// CHECK-LABEL: test_mm_srl_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.psrl.w		// CHECK: call <8 x i16> @llvm.x86.sse2.psrl.w(
return _mm_srl_pi16(a, b);		return _mm_srl_pi16(a, b);
}		}

__m64 test_mm_srl_pi32(__m64 a, __m64 b) {		__m64 test_mm_srl_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_srl_pi32		// CHECK-LABEL: test_mm_srl_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.psrl.d		// CHECK: call <4 x i32> @llvm.x86.sse2.psrl.d(
return _mm_srl_pi32(a, b);		return _mm_srl_pi32(a, b);
}		}

__m64 test_mm_srl_si64(__m64 a, __m64 b) {		__m64 test_mm_srl_si64(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_srl_si64		// CHECK-LABEL: test_mm_srl_si64
// CHECK: call x86_mmx @llvm.x86.mmx.psrl.q		// CHECK: call <2 x i64> @llvm.x86.sse2.psrl.q(
return _mm_srl_si64(a, b);		return _mm_srl_si64(a, b);
}		}

__m64 test_mm_srli_pi16(__m64 a) {		__m64 test_mm_srli_pi16(__m64 a) {
// CHECK-LABEL: test_mm_srli_pi16		// CHECK-LABEL: test_mm_srli_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.psrli.w		// CHECK: call <8 x i16> @llvm.x86.sse2.psrli.w(
return _mm_srli_pi16(a, 3);		return _mm_srli_pi16(a, 3);
}		}

__m64 test_mm_srli_pi32(__m64 a) {		__m64 test_mm_srli_pi32(__m64 a) {
// CHECK-LABEL: test_mm_srli_pi32		// CHECK-LABEL: test_mm_srli_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.psrli.d		// CHECK: call <4 x i32> @llvm.x86.sse2.psrli.d(
return _mm_srli_pi32(a, 3);		return _mm_srli_pi32(a, 3);
}		}

__m64 test_mm_srli_si64(__m64 a) {		__m64 test_mm_srli_si64(__m64 a) {
// CHECK-LABEL: test_mm_srli_si64		// CHECK-LABEL: test_mm_srli_si64
// CHECK: call x86_mmx @llvm.x86.mmx.psrli.q		// CHECK: call <2 x i64> @llvm.x86.sse2.psrli.q(
return _mm_srli_si64(a, 3);		return _mm_srli_si64(a, 3);
}		}

void test_mm_stream_pi(__m64 *p, __m64 a) {		void test_mm_stream_pi(__m64 *p, __m64 a) {
// CHECK-LABEL: test_mm_stream_pi		// CHECK-LABEL: test_mm_stream_pi
// CHECK: call void @llvm.x86.mmx.movnt.dq		// CHECK: store <1 x i64> {{%.}}, <1 x i64> {{%.*}}, align 8, !nontemporal
_mm_stream_pi(p, a);		_mm_stream_pi(p, a);
}		}

__m64 test_mm_sub_pi8(__m64 a, __m64 b) {		__m64 test_mm_sub_pi8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sub_pi8		// CHECK-LABEL: test_mm_sub_pi8
// CHECK: call x86_mmx @llvm.x86.mmx.psub.b		// CHECK: sub <8 x i8> {{%.}}, {{%.}}
return _mm_sub_pi8(a, b);		return _mm_sub_pi8(a, b);
}		}

__m64 test_mm_sub_pi16(__m64 a, __m64 b) {		__m64 test_mm_sub_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sub_pi16		// CHECK-LABEL: test_mm_sub_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.psub.w		// CHECK: sub <4 x i16> {{%.}}, {{%.}}
return _mm_sub_pi16(a, b);		return _mm_sub_pi16(a, b);
}		}

__m64 test_mm_sub_pi32(__m64 a, __m64 b) {		__m64 test_mm_sub_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sub_pi32		// CHECK-LABEL: test_mm_sub_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.psub.d		// CHECK: sub <2 x i32> {{%.}}, {{%.}}
return _mm_sub_pi32(a, b);		return _mm_sub_pi32(a, b);
}		}

__m64 test_mm_sub_si64(__m64 a, __m64 b) {		__m64 test_mm_sub_si64(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_sub_si64		// CHECK-LABEL: test_mm_sub_si64
// CHECK: call x86_mmx @llvm.x86.mmx.psub.q(x86_mmx %{{.}}, x86_mmx %{{.}})		// CHECK: sub i64 {{%.}}, {{%.}}
return _mm_sub_si64(a, b);		return _mm_sub_si64(a, b);
}		}

__m64 test_mm_subs_pi8(__m64 a, __m64 b) {		__m64 test_mm_subs_pi8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_subs_pi8		// CHECK-LABEL: test_mm_subs_pi8
// CHECK: call x86_mmx @llvm.x86.mmx.psubs.b		// CHECK: call <16 x i8> @llvm.ssub.sat.v16i8(
return _mm_subs_pi8(a, b);		return _mm_subs_pi8(a, b);
}		}

__m64 test_mm_subs_pi16(__m64 a, __m64 b) {		__m64 test_mm_subs_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_subs_pi16		// CHECK-LABEL: test_mm_subs_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.psubs.w		// CHECK: call <8 x i16> @llvm.ssub.sat.v8i16(
return _mm_subs_pi16(a, b);		return _mm_subs_pi16(a, b);
}		}

__m64 test_mm_subs_pu8(__m64 a, __m64 b) {		__m64 test_mm_subs_pu8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_subs_pu8		// CHECK-LABEL: test_mm_subs_pu8
// CHECK: call x86_mmx @llvm.x86.mmx.psubus.b		// CHECK: call <16 x i8> @llvm.usub.sat.v16i8(
return _mm_subs_pu8(a, b);		return _mm_subs_pu8(a, b);
}		}

__m64 test_mm_subs_pu16(__m64 a, __m64 b) {		__m64 test_mm_subs_pu16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_subs_pu16		// CHECK-LABEL: test_mm_subs_pu16
// CHECK: call x86_mmx @llvm.x86.mmx.psubus.w		// CHECK: call <8 x i16> @llvm.usub.sat.v8i16(
return _mm_subs_pu16(a, b);		return _mm_subs_pu16(a, b);
}		}

int test_m_to_int(__m64 a) {		int test_m_to_int(__m64 a) {
// CHECK-LABEL: test_m_to_int		// CHECK-LABEL: test_m_to_int
// CHECK: extractelement <2 x i32>		// CHECK: extractelement <2 x i32>
return _m_to_int(a);		return _m_to_int(a);
}		}

long long test_m_to_int64(__m64 a) {		long long test_m_to_int64(__m64 a) {
// CHECK-LABEL: test_m_to_int64		// CHECK-LABEL: test_m_to_int64
// CHECK: bitcast		// CHECK: bitcast
return _m_to_int64(a);		return _m_to_int64(a);
}		}

__m64 test_mm_unpackhi_pi8(__m64 a, __m64 b) {		__m64 test_mm_unpackhi_pi8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_unpackhi_pi8		// CHECK-LABEL: test_mm_unpackhi_pi8
// CHECK: call x86_mmx @llvm.x86.mmx.punpckhbw		// CHECK: shufflevector <8 x i8> {{%.}}, <8 x i8> {{%.}}, <8 x i32> <i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
return _mm_unpackhi_pi8(a, b);		return _mm_unpackhi_pi8(a, b);
}		}

__m64 test_mm_unpackhi_pi16(__m64 a, __m64 b) {		__m64 test_mm_unpackhi_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_unpackhi_pi16		// CHECK-LABEL: test_mm_unpackhi_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.punpckhwd		// CHECK: shufflevector <4 x i16> {{%.}}, <4 x i16> {{%.}}, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
return _mm_unpackhi_pi16(a, b);		return _mm_unpackhi_pi16(a, b);
}		}

__m64 test_mm_unpackhi_pi32(__m64 a, __m64 b) {		__m64 test_mm_unpackhi_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_unpackhi_pi32		// CHECK-LABEL: test_mm_unpackhi_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.punpckhdq		// CHECK: shufflevector <2 x i32> {{%.}}, <2 x i32> {{%.}}, <2 x i32> <i32 1, i32 3>
return _mm_unpackhi_pi32(a, b);		return _mm_unpackhi_pi32(a, b);
}		}

__m64 test_mm_unpacklo_pi8(__m64 a, __m64 b) {		__m64 test_mm_unpacklo_pi8(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_unpacklo_pi8		// CHECK-LABEL: test_mm_unpacklo_pi8
// CHECK: call x86_mmx @llvm.x86.mmx.punpcklbw		// CHECK: shufflevector <8 x i8> {{%.}}, <8 x i8> {{%.}}, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11>
return _mm_unpacklo_pi8(a, b);		return _mm_unpacklo_pi8(a, b);
}		}

__m64 test_mm_unpacklo_pi16(__m64 a, __m64 b) {		__m64 test_mm_unpacklo_pi16(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_unpacklo_pi16		// CHECK-LABEL: test_mm_unpacklo_pi16
// CHECK: call x86_mmx @llvm.x86.mmx.punpcklwd		// CHECK: shufflevector <4 x i16> {{%.}}, <4 x i16> {{%.}}, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
return _mm_unpacklo_pi16(a, b);		return _mm_unpacklo_pi16(a, b);
}		}

__m64 test_mm_unpacklo_pi32(__m64 a, __m64 b) {		__m64 test_mm_unpacklo_pi32(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_unpacklo_pi32		// CHECK-LABEL: test_mm_unpacklo_pi32
// CHECK: call x86_mmx @llvm.x86.mmx.punpckldq		// CHECK: shufflevector <2 x i32> {{%.}}, <2 x i32> {{%.}}, <2 x i32> <i32 0, i32 2>
return _mm_unpacklo_pi32(a, b);		return _mm_unpacklo_pi32(a, b);
}		}

__m64 test_mm_xor_si64(__m64 a, __m64 b) {		__m64 test_mm_xor_si64(__m64 a, __m64 b) {
// CHECK-LABEL: test_mm_xor_si64		// CHECK-LABEL: test_mm_xor_si64
// CHECK: call x86_mmx @llvm.x86.mmx.pxor		// CHECK: xor <1 x i64> {{%.}}, {{%.}}
return _mm_xor_si64(a, b);		return _mm_xor_si64(a, b);
}		}

clang/test/CodeGen/X86/mmx-shift-with-immediate.c

	// RUN: %clang -mmmx -target i386-unknown-unknown -emit-llvm -S %s -o - \| FileCheck %s			// RUN: %clang -mmmx -target i386-unknown-unknown -emit-llvm -S %s -o - \| FileCheck %s
	#include <mmintrin.h>			#include <mmintrin.h>

	void shift(__m64 a, __m64 b, int c) {			void shift(__m64 a, __m64 b, int c) {
	// CHECK: x86_mmx @llvm.x86.mmx.pslli.w(x86_mmx %{{.}}, i32 {{.}})			// CHECK: <8 x i16> @llvm.x86.sse2.pslli.w(<8 x i16> %{{.}}, i32 {{.}})
	_mm_slli_pi16(a, c);			_mm_slli_pi16(a, c);
	// CHECK: x86_mmx @llvm.x86.mmx.pslli.d(x86_mmx %{{.}}, i32 {{.}})			// CHECK: <4 x i32> @llvm.x86.sse2.pslli.d(<4 x i32> %{{.}}, i32 {{.}})
	_mm_slli_pi32(a, c);			_mm_slli_pi32(a, c);
	// CHECK: x86_mmx @llvm.x86.mmx.pslli.q(x86_mmx %{{.}}, i32 {{.}})			// CHECK: <2 x i64> @llvm.x86.sse2.pslli.q(<2 x i64> %{{.}}, i32 {{.}})
	_mm_slli_si64(a, c);			_mm_slli_si64(a, c);

	// CHECK: x86_mmx @llvm.x86.mmx.psrli.w(x86_mmx %{{.}}, i32 {{.}})			// CHECK: <8 x i16> @llvm.x86.sse2.psrli.w(<8 x i16> %{{.}}, i32 {{.}})
	_mm_srli_pi16(a, c);			_mm_srli_pi16(a, c);
	// CHECK: x86_mmx @llvm.x86.mmx.psrli.d(x86_mmx %{{.}}, i32 {{.}})			// CHECK: <4 x i32> @llvm.x86.sse2.psrli.d(<4 x i32> %{{.}}, i32 {{.}})
	_mm_srli_pi32(a, c);			_mm_srli_pi32(a, c);
	// CHECK: x86_mmx @llvm.x86.mmx.psrli.q(x86_mmx %{{.}}, i32 {{.}})			// CHECK: <2 x i64> @llvm.x86.sse2.psrli.q(<2 x i64> %{{.}}, i32 {{.}})
	_mm_srli_si64(a, c);			_mm_srli_si64(a, c);

	// CHECK: x86_mmx @llvm.x86.mmx.psrai.w(x86_mmx %{{.}}, i32 {{.}})			// CHECK: <8 x i16> @llvm.x86.sse2.psrai.w(<8 x i16> %{{.}}, i32 {{.}})
	_mm_srai_pi16(a, c);			_mm_srai_pi16(a, c);
	// CHECK: x86_mmx @llvm.x86.mmx.psrai.d(x86_mmx %{{.}}, i32 {{.}})			// CHECK: <4 x i32> @llvm.x86.sse2.psrai.d(<4 x i32> %{{.}}, i32 {{.}})
	_mm_srai_pi32(a, c);			_mm_srai_pi32(a, c);
	}			}

clang/test/CodeGen/attr-target-x86-mmx.c

	// RUN: %clang_cc1 -triple i386-linux-gnu -emit-llvm %s -o - \| FileCheck %s			// RUN: %clang_cc1 -triple i386-linux-gnu -emit-llvm %s -o - \| FileCheck %s
	// Picking a cpu that doesn't have mmx or sse by default so we can enable it later.			// Picking a cpu that doesn't have sse by default so we can enable it later.

	#define __MM_MALLOC_H			#define __MM_MALLOC_H

	#include <x86intrin.h>			#include <x86intrin.h>

	// Verify that when we turn on sse that we also turn on mmx.			void __attribute__((target("sse2"))) shift(__m64 a, __m64 b, int c) {
	void __attribute__((target("sse"))) shift(__m64 a, __m64 b, int c) {
	_mm_slli_pi16(a, c);			_mm_slli_pi16(a, c);
	_mm_slli_pi32(a, c);			_mm_slli_pi32(a, c);
	_mm_slli_si64(a, c);			_mm_slli_si64(a, c);

	_mm_srli_pi16(a, c);			_mm_srli_pi16(a, c);
	_mm_srli_pi32(a, c);			_mm_srli_pi32(a, c);
	_mm_srli_si64(a, c);			_mm_srli_si64(a, c);

	_mm_srai_pi16(a, c);			_mm_srai_pi16(a, c);
	_mm_srai_pi32(a, c);			_mm_srai_pi32(a, c);
	}			}

	// CHECK: "target-features"="+cx8,+mmx,+sse,+x87"			// CHECK: "target-features"="+cx8,+mmx,+sse,+sse2,+x87"

clang/test/Headers/xmmintrin.c

	// RUN: %clang_cc1 %s -ffreestanding -triple x86_64-apple-macosx10.9.0 -emit-llvm -o - \| FileCheck %s			// RUN: %clang_cc1 %s -ffreestanding -triple x86_64-apple-macosx10.9.0 -emit-llvm -o - \| FileCheck %s
	//			//
	// RUN: rm -rf %t			// RUN: rm -rf %t
	// RUN: %clang_cc1 %s -ffreestanding -triple x86_64-apple-macosx10.9.0 -emit-llvm -o - \			// RUN: %clang_cc1 %s -ffreestanding -triple x86_64-apple-macosx10.9.0 -emit-llvm -o - \
	// RUN: -fmodules -fimplicit-module-maps -fmodules-cache-path=%t -isystem %S/Inputs/include \			// RUN: -fmodules -fimplicit-module-maps -fmodules-cache-path=%t -isystem %S/Inputs/include \
	// RUN: \| FileCheck %s			// RUN: \| FileCheck %s
	// REQUIRES: x86-registered-target			// REQUIRES: x86-registered-target
	#include <xmmintrin.h>			#include <xmmintrin.h>

	// CHECK: @c ={{.*}} global i8 0, align 16			// CHECK: @c ={{.*}} global i8 0, align 16
	_MM_ALIGN16 char c;			_MM_ALIGN16 char c;

	// Make sure the last step of _mm_cvtps_pi16 converts <4 x i32> to <4 x i16> by			// Make sure the last step of _mm_cvtps_pi16 converts <4 x i32> to <4 x i16> by
	// checking that clang emits PACKSSDW instead of PACKSSWB.			// checking that clang emits PACKSSDW instead of PACKSSWB.

	// CHECK: define{{.*}} i64 @test_mm_cvtps_pi16			// CHECK: define{{.*}} i64 @test_mm_cvtps_pi16
	// CHECK: call x86_mmx @llvm.x86.mmx.packssdw			// CHECK: call <8 x i16> @llvm.x86.sse2.packssdw.128

	__m64 test_mm_cvtps_pi16(__m128 a) {			__m64 test_mm_cvtps_pi16(__m128 a) {
	return _mm_cvtps_pi16(a);			return _mm_cvtps_pi16(a);
	}			}

	// Make sure that including <xmmintrin.h> also makes <emmintrin.h>'s content available.			// Make sure that including <xmmintrin.h> also makes <emmintrin.h>'s content available.
	// This is an ugly hack for GCC compatibility.			// This is an ugly hack for GCC compatibility.
	__m128d test_xmmintrin_provides_emmintrin(__m128d __a, __m128d __b) {			__m128d test_xmmintrin_provides_emmintrin(__m128d __a, __m128d __b) {
	return _mm_add_sd(__a, __b);			return _mm_add_sd(__a, __b);
	}			}

	#if __STDC_HOSTED__			#if __STDC_HOSTED__
	// Make sure stdlib.h symbols are accessible.			// Make sure stdlib.h symbols are accessible.
	void *p = NULL;			void *p = NULL;
	#endif			#endif

clang/test/Sema/x86-builtin-palignr.c

	// RUN: %clang_cc1 -ffreestanding -fsyntax-only -target-feature +ssse3 -target-feature +mmx -verify -triple x86_64-pc-linux-gnu %s			// RUN: %clang_cc1 -ffreestanding -fsyntax-only -target-feature +ssse3 -target-feature +mmx -verify -triple x86_64-pc-linux-gnu %s
	// RUN: %clang_cc1 -ffreestanding -fsyntax-only -target-feature +ssse3 -target-feature +mmx -verify -triple i686-apple-darwin10 %s			// RUN: %clang_cc1 -ffreestanding -fsyntax-only -target-feature +ssse3 -target-feature +mmx -verify -triple i686-apple-darwin10 %s

	#include <tmmintrin.h>			#include <tmmintrin.h>

	__m64 test1(__m64 a, __m64 b, int c) {			__m64 test1(__m64 a, __m64 b, int c) {
	return _mm_alignr_pi8(a, b, c); // expected-error {{argument to '__builtin_ia32_palignr' must be a constant integer}}			return _mm_alignr_pi8(a, b, c); // expected-error {{argument to '__builtin_ia32_psrldqi128_byteshift' must be a constant integer}}
	}			}

llvm/include/llvm/IR/IntrinsicsX86.td

Show First 20 Lines • Show All 2,418 Lines • ▼ Show 20 Lines	let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".

def int_x86_mmx_movnt_dq : GCCBuiltin<"__builtin_ia32_movntq">,		def int_x86_mmx_movnt_dq : GCCBuiltin<"__builtin_ia32_movntq">,
Intrinsic<[], [llvm_ptrx86mmx_ty, llvm_x86mmx_ty], []>;		Intrinsic<[], [llvm_ptrx86mmx_ty, llvm_x86mmx_ty], []>;

def int_x86_mmx_palignr_b : GCCBuiltin<"__builtin_ia32_palignr">,		def int_x86_mmx_palignr_b : GCCBuiltin<"__builtin_ia32_palignr">,
Intrinsic<[llvm_x86mmx_ty], [llvm_x86mmx_ty,		Intrinsic<[llvm_x86mmx_ty], [llvm_x86mmx_ty,
llvm_x86mmx_ty, llvm_i8_ty], [IntrNoMem, ImmArg<ArgIndex<2>>]>;		llvm_x86mmx_ty, llvm_i8_ty], [IntrNoMem, ImmArg<ArgIndex<2>>]>;

def int_x86_mmx_pextr_w : GCCBuiltin<"__builtin_ia32_vec_ext_v4hi">,		def int_x86_mmx_pextr_w :
Intrinsic<[llvm_i32_ty], [llvm_x86mmx_ty, llvm_i32_ty],		Intrinsic<[llvm_i32_ty], [llvm_x86mmx_ty, llvm_i32_ty],
[IntrNoMem, ImmArg<ArgIndex<1>>]>;		[IntrNoMem, ImmArg<ArgIndex<1>>]>;

def int_x86_mmx_pinsr_w : GCCBuiltin<"__builtin_ia32_vec_set_v4hi">,		def int_x86_mmx_pinsr_w :
Intrinsic<[llvm_x86mmx_ty], [llvm_x86mmx_ty,		Intrinsic<[llvm_x86mmx_ty], [llvm_x86mmx_ty,
llvm_i32_ty, llvm_i32_ty], [IntrNoMem, ImmArg<ArgIndex<2>>]>;		llvm_i32_ty, llvm_i32_ty], [IntrNoMem, ImmArg<ArgIndex<2>>]>;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// BMI		// BMI

let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".		let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".
▲ Show 20 Lines • Show All 2,639 Lines • Show Last 20 Lines

mmx-tests/Makefile

This file was added.

				USE_XMM=
				#USE_XMM=--use-xmm

				OLDCC ?= clang-10
				NEWCC ?= ../build/bin/clang
				TESTCC=$(OLDCC)
				COPTS ?=

				gen_orig.c: mmx-tests.py
				./mmx-tests.py --kind=wrapper --wrapper-prefix=orig $(USE_XMM) > $@
				gen_orig.h: mmx-tests.py
				./mmx-tests.py --kind=wrapper_h --wrapper-prefix=orig $(USE_XMM) > $@
				gen_new.c: mmx-tests.py
				./mmx-tests.py --kind=wrapper --wrapper-prefix=new $(USE_XMM) > $@
				gen_new.h: mmx-tests.py
				./mmx-tests.py --kind=wrapper_h --wrapper-prefix=new $(USE_XMM) > $@
				gen_test.inc: mmx-tests.py
				./mmx-tests.py --kind=test $(USE_XMM) > $@
				gen_orig.o: gen_orig.c
				$(OLDCC) -c $(COPTS) -O2 -o $@ $^
				gen_new.o: gen_new.c
				$(NEWCC) -c $(COPTS) -O2 -o $@ $^
				test.o: test.c gen_test.inc gen_orig.h gen_new.h
				$(TESTCC) -c $(COPTS) -o $@ test.c
				test: test.o gen_orig.o gen_new.o
				$(TESTCC) $(COPTS) -o $@ $^ -lm

				clean:
				rm -f gen_orig.c gen_orig.h gen_new.c gen_new.h gen_test.inc gen_orig.o gen_new.o test.o test

mmx-tests/mmx-tests.py

This file was added.

Property	Old Value	New Value
File Mode	null	100755

				#!/usr/bin/python3

				import argparse
				import sys

				# This is a list of all intel functions and macros which take or
				# return an __m64.
				def do_mmx(fn):
				# mmintrin.h
				fn("_mm_cvtsi32_si64", "__m64", ("int", ))
				fn("_mm_cvtsi64_si32", "int", ("__m64", ))
				fn("_mm_cvtsi64_m64", "__m64", ("long long", ), condition='defined(__X86_64__) \|\| defined(__clang__)')
				fn("_mm_cvtm64_si64", "long long", ("__m64", ), condition='defined(__X86_64__) \|\| defined(__clang__)')
				fn("_mm_packs_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_packs_pi32", "__m64", ("__m64", "__m64", ))
				fn("_mm_packs_pu16", "__m64", ("__m64", "__m64", ))
				fn("_mm_unpackhi_pi8", "__m64", ("__m64", "__m64", ))
				fn("_mm_unpackhi_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_unpackhi_pi32", "__m64", ("__m64", "__m64", ))
				fn("_mm_unpacklo_pi8", "__m64", ("__m64", "__m64", ))
				fn("_mm_unpacklo_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_unpacklo_pi32", "__m64", ("__m64", "__m64", ))
				fn("_mm_add_pi8", "__m64", ("__m64", "__m64", ))
				fn("_mm_add_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_add_pi32", "__m64", ("__m64", "__m64", ))
				fn("_mm_adds_pi8", "__m64", ("__m64", "__m64", ))
				fn("_mm_adds_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_adds_pu8", "__m64", ("__m64", "__m64", ))
				fn("_mm_adds_pu16", "__m64", ("__m64", "__m64", ))
				fn("_mm_sub_pi8", "__m64", ("__m64", "__m64", ))
				fn("_mm_sub_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_sub_pi32", "__m64", ("__m64", "__m64", ))
				fn("_mm_subs_pi8", "__m64", ("__m64", "__m64", ))
				fn("_mm_subs_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_subs_pu8", "__m64", ("__m64", "__m64", ))
				fn("_mm_subs_pu16", "__m64", ("__m64", "__m64", ))
				fn("_mm_madd_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_mulhi_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_mullo_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_sll_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_slli_pi16", "__m64", ("__m64", "int", ))
				fn("_mm_sll_pi32", "__m64", ("__m64", "__m64", ))
				fn("_mm_slli_pi32", "__m64", ("__m64", "int", ))
				fn("_mm_sll_si64", "__m64", ("__m64", "__m64", ))
				fn("_mm_slli_si64", "__m64", ("__m64", "int", ))
				fn("_mm_sra_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_srai_pi16", "__m64", ("__m64", "int", ))
				fn("_mm_sra_pi32", "__m64", ("__m64", "__m64", ))
				fn("_mm_srai_pi32", "__m64", ("__m64", "int", ))
				fn("_mm_srl_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_srli_pi16", "__m64", ("__m64", "int", ))
				fn("_mm_srl_pi32", "__m64", ("__m64", "__m64", ))
				fn("_mm_srli_pi32", "__m64", ("__m64", "int", ))
				fn("_mm_srl_si64", "__m64", ("__m64", "__m64", ))
				fn("_mm_srli_si64", "__m64", ("__m64", "int", ))
				fn("_mm_and_si64", "__m64", ("__m64", "__m64", ))
				fn("_mm_andnot_si64", "__m64", ("__m64", "__m64", ))
				fn("_mm_or_si64", "__m64", ("__m64", "__m64", ))
				fn("_mm_xor_si64", "__m64", ("__m64", "__m64", ))
				fn("_mm_cmpeq_pi8", "__m64", ("__m64", "__m64", ))
				fn("_mm_cmpeq_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_cmpeq_pi32", "__m64", ("__m64", "__m64", ))
				fn("_mm_cmpgt_pi8", "__m64", ("__m64", "__m64", ))
				fn("_mm_cmpgt_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_cmpgt_pi32", "__m64", ("__m64", "__m64", ))
				fn("_mm_setzero_si64", "__m64", ())
				fn("_mm_set_pi32", "__m64", ("int", "int", ))
				fn("_mm_set_pi16", "__m64", ("short", "short", "short", "short", ))
				fn("_mm_set_pi8", "__m64", ("char", "char", "char", "char", "char", "char", "char", "char", ))
				fn("_mm_set1_pi32", "__m64", ("int", ))
				fn("_mm_set1_pi16", "__m64", ("short", ))
				fn("_mm_set1_pi8", "__m64", ("char", ))
				fn("_mm_setr_pi32", "__m64", ("int", "int", ))
				fn("_mm_setr_pi16", "__m64", ("short", "short", "short", "short", ))
				fn("_mm_setr_pi8", "__m64", ("char", "char", "char", "char", "char", "char", "char", "char", ))

				# xmmintrin.h
				fn("_mm_cvtps_pi32", "__m64", ("__m128", ))
				fn("_mm_cvt_ps2pi", "__m64", ("__m128", ))
				fn("_mm_cvttps_pi32", "__m64", ("__m128", ))
				fn("_mm_cvtt_ps2pi", "__m64", ("__m128", ))
				fn("_mm_cvtpi32_ps", "__m128", ("__m128", "__m64", ))
				fn("_mm_cvt_pi2ps", "__m128", ("__m128", "__m64", ))
				fn("_mm_loadh_pi", "__m128", ("__m128", "const __m64 *", ))
				fn("_mm_loadl_pi", "__m128", ("__m128", "const __m64 *", ))
				fn("_mm_storeh_pi", "void", ("__m64 *", "__m128", ))
				fn("_mm_storel_pi", "void", ("__m64 *", "__m128", ))
				fn("_mm_stream_pi", "void", ("__m64 *", "__m64", ))
				fn("_mm_max_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_max_pu8", "__m64", ("__m64", "__m64", ))
				fn("_mm_min_pi16", "__m64", ("__m64", "__m64", ))
				fn("_mm_min_pu8", "__m64", ("__m64", "__m64", ))
				fn("_mm_movemask_pi8", "int", ("__m64", ))
				fn("_mm_mulhi_pu16", "__m64", ("__m64", "__m64", ))
				fn("_mm_maskmove_si64", "void", ("__m64", "__m64", "char *", ))
				fn("_mm_avg_pu8", "__m64", ("__m64", "__m64", ))
				fn("_mm_avg_pu16", "__m64", ("__m64", "__m64", ))
				fn("_mm_sad_pu8", "__m64", ("__m64", "__m64", ))
				fn("_mm_cvtpi16_ps", "__m128", ("__m64", ))
				fn("_mm_cvtpu16_ps", "__m128", ("__m64", ))
				fn("_mm_cvtpi8_ps", "__m128", ("__m64", ))
				fn("_mm_cvtpu8_ps", "__m128", ("__m64", ))
				fn("_mm_cvtpi32x2_ps", "__m128", ("__m64", "__m64", ))
				fn("_mm_cvtps_pi16", "__m64", ("__m128", ))
				fn("_mm_cvtps_pi8", "__m64", ("__m128", ))

				fn("_mm_extract_pi16", "int", ("__m64", "int", ), imm_range=(0, 3))
				fn("_mm_insert_pi16", "__m64", ("__m64", "int", "int", ), imm_range=(0, 3))
				fn("_mm_shuffle_pi16", "__m64", ("__m64", "int", ), imm_range=(0, 255))

				# emmintrin.h
				fn("_mm_cvtpd_pi32", "__m64", ("__m128d", ))
				fn("_mm_cvttpd_pi32", "__m64", ("__m128d", ))
				fn("_mm_cvtpi32_pd", "__m128d", ("__m64", ))
				fn("_mm_add_si64", "__m64", ("__m64", "__m64", ))
				fn("_mm_mul_su32", "__m64", ("__m64", "__m64", ))
				fn("_mm_sub_si64", "__m64", ("__m64", "__m64", ))
				fn("_mm_set_epi64", "__m128i", ("__m64", "__m64", ))
				fn("_mm_set1_epi64", "__m128i", ("__m64", ))
				fn("_mm_setr_epi64", "__m128i", ("__m64", "__m64", ))
				fn("_mm_movepi64_pi64", "__m64", ("__m128i", ))
				fn("_mm_movpi64_epi64", "__m128i", ("__m64", ))

				# tmmintrin.h
				fn("_mm_abs_pi8", "__m64", ("__m64", ), target='ssse3')
				fn("_mm_abs_pi16", "__m64", ("__m64", ), target='ssse3')
				fn("_mm_abs_pi32", "__m64", ("__m64", ), target='ssse3')
				fn("_mm_hadd_pi16", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_hadd_pi32", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_hadds_pi16", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_hsub_pi16", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_hsub_pi32", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_hsubs_pi16", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_maddubs_pi16", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_mulhrs_pi16", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_shuffle_pi8", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_sign_pi8", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_sign_pi16", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_sign_pi32", "__m64", ("__m64", "__m64", ), target='ssse3')
				fn("_mm_alignr_pi8", "__m64", ("__m64", "__m64", "int", ), imm_range=(0, 18), target='ssse3')

				# Generate a file full of wrapper functions for each of the above mmx
				# functions.
				#
				# If use_xmm is set, pass/return arguments as __m128 rather than of
				# __m64.
				def define_wrappers(prefix, use_xmm=True, header=False):
				if header:
				print('#pragma once')

				print('#include <immintrin.h>')
				if use_xmm and not header:
				print('#define m128_to_m64(x) ((__m64)((__v2di)(x))[0])')
				print('#define m64_to_m128(x) ((__m128)(__v2di){(long long)(__m64)(x), 0})')

				def fn(name, ret_ty, arg_tys, imm_range=None, target=None, condition=None):
				if condition:
				print(f'#if {condition}')
				convert_ret = False
				if use_xmm and ret_ty == '__m64':
				ret_ty = '__v2di'
				convert_ret = True

				if target:
				attr = f'__attribute__((target("{target}"))) '
				else:
				attr = ''

				if imm_range:
				arg_tys = arg_tys[:-1]
				def translate_type(t):
				if use_xmm and t == '__m64':
				return '__m128'
				return t
				def translate_arg(t, a):
				if use_xmm and t == '__m64':
				return f'm128_to_m64({a})'
				return a

				arg_decl = ', '.join(f'{translate_type(v[1])} arg_{v[0]}' for v in enumerate(arg_tys)) or 'void'
				call_args = ', '.join(translate_arg(v[1], f'arg_{v[0]}') for v in enumerate(arg_tys))

				def create_fn(suffix, extraarg):
				if header:
				print(f'{ret_ty} {prefix}_{name}{suffix}({arg_decl});')
				else:
				print(f'{attr}{ret_ty} {prefix}_{name}{suffix}({arg_decl})')
				if use_xmm and convert_ret:
				print(f'{{ return ({ret_ty})m64_to_m128({name}({call_args}{extraarg})); }}')
				else:
				print(f'{{ return {name}({call_args}{extraarg}); }}')

				if imm_range:
				for i in range(imm_range[0], imm_range[1]+1):
				create_fn(f'_{i}', f', {i}')
				else:
				create_fn('', '')
				if condition:
				print('#endif')

				do_mmx(fn)


				# Create a C file that tests an "orig" set of wrappers against a "new"
				# set of wrappers.
				def define_tests(use_xmm=False):
				def fn(name, ret_ty, arg_tys, imm_range=None, target=None, condition=None):
				if condition:
				print(f'#if {condition}')
				arg_decl = ', '.join(f'{v[1]} arg_{v[0]}' for v in enumerate(arg_tys)) or 'void'
				print(f' // {ret_ty} {name}({arg_decl});')

				if imm_range:
				for i in range(imm_range[0], imm_range[1]+1):
				fn(name + f'_{i}', ret_ty, arg_tys[:-1], target=target)
				return

				convert_pre = convert_post = ''
				if use_xmm and ret_ty == '__m64':
				convert_pre = 'm128_to_m64('
				convert_post = ')'

				args=[]
				loops=[]
				printf_fmts = []
				printf_args = []
				for arg_ty in arg_tys:
				v=len(loops)
				if arg_ty in ('char', 'short'):
				loops.append(f' for(int l{v} = 0; l{v} < arraysize(short_vals); ++l{v}) {{')
				args.append(f'({arg_ty})short_vals[l{v}]')
				printf_fmts.append('%016x')
				printf_args.append(f'short_vals[l{v}]')
				elif arg_ty in ('int', 'long long'):
				loops.append(f' for(int l{v} = 0; l{v} < arraysize(mmx_vals); ++l{v}) {{')
				args.append(f'({arg_ty})mmx_vals[l{v}]')
				printf_fmts.append('%016llx')
				printf_args.append(f'mmx_vals[l{v}]')
				elif arg_ty == '__m64':
				loops.append(f' for(int l{v} = 0; l{v} < arraysize(mmx_vals); ++l{v}) {{')
				if use_xmm:
				loops.append(f' for(int l{v+1} = 0; l{v+1} < arraysize(padding_mmx_vals); ++l{v+1}) {{')
				args.append(f'(__m128)(__m128i){{mmx_vals[l{v}], padding_mmx_vals[l{v+1}]}}')
				printf_fmts.append('(__m128i){%016llx, %016llx}')
				printf_args.append(f'mmx_vals[l{v}], padding_mmx_vals[l{v+1}]')
				else:
				args.append(f'({arg_ty})mmx_vals[l{v}]')
				printf_fmts.append('%016llx')
				printf_args.append(f'mmx_vals[l{v}]')
				elif arg_ty in ('__m128', '__m128i', '__m128d'):
				loops.append(f' for(int l{v} = 0; l{v} < arraysize(mmx_vals); ++l{v}) {{')
				loops.append(f' for(int l{v+1} = 0; l{v+1} < arraysize(mmx_vals); ++l{v+1}) {{')
				args.append(f'({arg_ty})(__m128i){{mmx_vals[l{v}], mmx_vals[l{v+1}]}}')
				printf_fmts.append('(__m128i){%016llx, %016llx}')
				printf_args.append(f'mmx_vals[l{v}], mmx_vals[l{v+1}]')
				elif arg_ty == 'const __m64 *':
				loops.append(f' for(int l{v} = 0; l{v} < arraysize(mmx_vals); ++l{v}) {{\n' +
				f' mem.m64 = (__m64)mmx_vals[l{v}];')
				args.append(f'&mem.m64')
				printf_fmts.append('&mem.m64 /* %016llx */')
				printf_args.append(f'(long long)mem.m64')
				else:
				print(' // -> UNSUPPORTED')
				return

				printf_fmt_str = '"' + ', '.join(printf_fmts) + '"'
				if printf_args:
				printf_arg_str = ', ' + ','.join(printf_args)
				else:
				printf_arg_str = ''

				print('\n'.join(loops))
				print(f'''
				clear_exc_flags();
				{ret_ty} orig_res = {convert_pre}orig_{name}({", ".join(args)}){convert_post};
				int orig_exc = get_exc_flags();
				clear_exc_flags();
				{ret_ty} new_res = {convert_pre}new_{name}({", ".join(args)}){convert_post};
				int new_exc = get_exc_flags();
				check_mismatch("{name}", orig_exc, new_exc, &orig_res, &new_res, sizeof(orig_res), {printf_fmt_str}{printf_arg_str});
				''')
				print(' }\n' * len(loops))
				print()
				if condition:
				print('#endif')

				do_mmx(fn)


				parser = argparse.ArgumentParser(description='Generate mmx test code.')
				parser.add_argument('--kind', choices=['wrapper', 'wrapper_h', 'test'])
				parser.add_argument('--wrapper-prefix', default='orig')
				parser.add_argument('--use-xmm', action='store_true')

				args = parser.parse_args()
				if args.kind == 'wrapper':
				define_wrappers(args.wrapper_prefix, use_xmm=args.use_xmm, header=False)
				elif args.kind == 'wrapper_h':
				define_wrappers(args.wrapper_prefix, use_xmm=args.use_xmm, header=True)
				elif args.kind == 'test':
				define_tests(use_xmm=args.use_xmm)

mmx-tests/test.c

This file was added.

				#include <fenv.h>
				#include <stdarg.h>
				#include <stdio.h>
				#include <stdlib.h>
				#include <string.h>
				#include <sys/mman.h>

				#include "gen_orig.h"
				#include "gen_new.h"


				// A bunch of helper functions for the code in gen_test.inc
				#define m128_to_m64(x) (__m64)((__v2di)(x))[0]

				#define arraysize(a) (sizeof(a) / sizeof(*a))

				static void dump_mem(void *ptr, int nbytes) {
				for (int i = 0; i < nbytes; ++i) {
				printf(" %02x", ((unsigned char*)ptr)[i]);
				}
				printf("\n");
				}

				static int get_exc_flags() {
				return fetestexcept(FE_ALL_EXCEPT \| __FE_DENORM);
				}

				static void clear_exc_flags() {
				feclearexcept(FE_ALL_EXCEPT \| __FE_DENORM);
				}

				static void dump_exc_flags(int exc_flags) {
				printf("%x", exc_flags);
				if (exc_flags & FE_INEXACT)
				printf(" inexact");
				if (exc_flags & FE_DIVBYZERO)
				printf(" divbyzero");
				if (exc_flags & FE_UNDERFLOW)
				printf(" underflow");
				if (exc_flags & FE_OVERFLOW)
				printf(" overflow");
				if (exc_flags & FE_INVALID)
				printf(" invalid");
				if (exc_flags & __FE_DENORM)
				printf(" denormal");
				}

				static void dump_result(int orig_exc, int new_exc, void orig_data, void new_data, int nbytes) {
				printf(" orig_exc = ");
				dump_exc_flags(orig_exc);
				printf(" new_exc = ");
				dump_exc_flags(new_exc);
				printf("\n");
				printf(" orig");
				dump_mem(orig_data, nbytes);
				printf(" new ");
				dump_mem(new_data, nbytes);
				}

				static void check_mismatch(const char *name, int orig_exc, int new_exc,
				void orig_data, void new_data, int nbytes,
				const char *printf_fmt, ...) {
				if (orig_exc != new_exc \|\| memcmp(orig_data, new_data, nbytes)) {
				va_list args;
				va_start(args, printf_fmt);
				printf("mismatch %s(", name);
				vprintf(printf_fmt, args);
				printf("):\n");
				dump_result(orig_exc, new_exc, orig_data, new_data, nbytes);
				va_end(args);
				}
				}

				unsigned short short_vals[] = {
				0x0000,
				0x0001,
				0xffee,
				0xffff,
				};

				unsigned long long padding_mmx_vals[] = {
				0x0000000000000000LL,
				0xffffffffffffffffLL,
				0x7fc000007fc00000LL, // float nan nan
				0xfff8000000000000LL, // -nan
				};

				unsigned long long mmx_vals[] = {
				0x0000000000000000LL,
				0x0000000000000001LL,
				0x0000000000000002LL,
				0x0000000000000003LL,
				0x0000000000000004LL,
				0x0000000000000005LL,
				0x0000000000000006LL,
				0x0000000000000007LL,
				0x0000000000000008LL,
				0x0000000000000009LL,
				0x000000000000000aLL,
				0x000000000000000bLL,
				0x000000000000000cLL,
				0x000000000000000dLL,
				0x000000000000000eLL,
				0x000000000000000fLL,
				0x0000000000000100LL,
				0x0000000000010000LL,
				0x0000000001000000LL,
				0x0000000100000000LL,
				0x0000010000000000LL,
				0x0001000000000000LL,
				0x0100000000000000LL,
				0x0101010101010101LL,
				0x0102030405060708LL,
				0x1234567890abcdefLL,
				0x007f007f007f007fLL,
				0x7f007f007f007f00LL,
				0x7f7f7f7f7f7f7f7fLL,
				0x8000800080008000LL,
				0x0080008000800080LL,
				0x8080808080808080LL,
				0x7fff7fff7fff7fffLL,
				0x8000800080008000LL,
				0x7fffffff7fffffffLL,
				0x8000000080000000LL,
				0x0000777700006666LL,
				0x7777000066660000LL,
				0x0000ffff0000eeeeLL,
				0xffff0000eeee0000LL,
				0x7700660055004400LL,
				0x0077006600550044LL,
				0xff00ee00dd00cc00LL,
				0x00ff00ee00dd00ccLL,
				0xffffffffffffffffLL,
				0x3ff0000000000000LL, // 1.0
				0x3ff8000000000000LL, // 1.5
				0x4000000000000000LL, // 2.0
				0x3f8000003fc00000LL, // float 1.0 1.5
				0x3fc0000040000000LL, // float 1.5 2.0
				0x7ff0000000000000LL, // inf
				0x7f8000007f800000LL, // float inf inf
				0xfff0000000000000LL, // -inf
				0xff800000ff800000LL, // float -inf -inf
				0x7ff8000000000000LL, // nan
				0x7fc000007fc00000LL, // float nan nan
				0xfff8000000000000LL, // -nan
				0xffc00000ffc00000LL, // float -nan -nan
				};

				struct __attribute__((aligned(sizeof(__m128)))) Mem {
				__m64 dummy;
				__m64 m64;
				} mem, mem2;

				// These 3 could be autogenerated...but I didn't add support for stores to the generator.
				void test_stores() {
				// void _mm_storeh_pi(__m64 * arg_0, __m128 arg_1);
				for(int l0 = 0; l0 < arraysize(mmx_vals); ++l0) {
				for(int l1 = 0; l1 < arraysize(mmx_vals); ++l1) {
				clear_exc_flags();
				orig__mm_storeh_pi(&mem.m64, (__m128)(__m128i){mmx_vals[l0], mmx_vals[l1]});
				int orig_exc = get_exc_flags();
				clear_exc_flags();
				new__mm_storeh_pi(&mem2.m64, (__m128)(__m128i){mmx_vals[l0], mmx_vals[l1]});
				int new_exc = get_exc_flags();
				check_mismatch("_mm_storeh_pi", orig_exc, new_exc, &mem.m64, &mem2.m64, sizeof(__m64),
				"&mem.m64, (__m128i){%016llx, %016llx},", mmx_vals[l0], mmx_vals[l1]);
				}
				}

				// void _mm_storel_pi(__m64 * arg_0, __m128 arg_1);
				for(int l0 = 0; l0 < arraysize(mmx_vals); ++l0) {
				for(int l1 = 0; l1 < arraysize(mmx_vals); ++l1) {
				clear_exc_flags();
				orig__mm_storel_pi(&mem.m64, (__m128)(__m128i){mmx_vals[l0], mmx_vals[l1]});
				int orig_exc = get_exc_flags();
				clear_exc_flags();
				new__mm_storel_pi(&mem2.m64, (__m128)(__m128i){mmx_vals[l0], mmx_vals[l1]});
				int new_exc = get_exc_flags();
				check_mismatch("_mm_storeh_pi", orig_exc, new_exc, &mem.m64, &mem2.m64, sizeof(__m64),
				"&mem.m64, (__m128i){%016llx, %016llx},", mmx_vals[l0], mmx_vals[l1]);
				}
				}

				// void _mm_stream_pi(__m64 * arg_0, __m64 arg_1);
				for(int l0 = 0; l0 < arraysize(mmx_vals); ++l0) {
				clear_exc_flags();
				orig__mm_stream_pi(&mem.m64, (__m64)mmx_vals[l0]);
				int orig_exc = get_exc_flags();
				clear_exc_flags();
				new__mm_stream_pi(&mem2.m64, (__m64)mmx_vals[l0]);
				int new_exc = get_exc_flags();
				check_mismatch("_mm_stream_pi", orig_exc, new_exc, &mem.m64, &mem2.m64, sizeof(__m64),
				"&mem.m64, %016llx,", mmx_vals[l0]);
				}
				}

				// Test that the nominally 64-bit maskmove doesn't trap at the edges of
				// non-writable memory, despite being implemented by a 128-bit write.
				void test_maskmove() {
				// Create a page memory with an inaccessible page on either side.
				char map = mmap(0, 3 4096, PROT_READ\|PROT_WRITE, MAP_PRIVATE\|MAP_ANONYMOUS, -1, 0);
				if (!map)
				abort();
				if (mprotect(map, 4096, PROT_NONE))
				abort();
				if (mprotect(map + 4096 * 2, 4096, PROT_NONE))
				abort();
				long long init_val = 0xffeeddccbbaa9900;
				long long expected = 0x11ee3344bb669900;
				for (int offset = 0; offset < 16+9; ++offset) {
				char *copy_location = map + 4096 + (offset > 16 ? 4096 - 32 + offset : offset);
				memcpy(copy_location, &init_val, 8);
				new__mm_maskmove_si64((__m64)0x1122334455667788LL, (__m64)0x8000808000800000, copy_location);
				long long result;
				memcpy(&result, copy_location, 8);
				if (memcmp(&expected, &result, 8) != 0) {
				printf("test_maskmove: wrong value was stored %llx vs %llx\n", result, expected);
				return;
				}
				}
				}

				void test_generated() {
				#include "gen_test.inc"
				}

				int main() {
				int rounding[] = {FE_TONEAREST, FE_UPWARD, FE_DOWNWARD, FE_TOWARDZERO};
				for (int i = 0; i < 4; ++i)
				{
				fesetround(rounding[i]);

				test_maskmove();
				test_stores();
				test_generated();
				}
				}

This is an archive of the discontinued LLVM Phabricator instance.

Convert __m64 intrinsics to unconditionally use SSE2 instead of MMX instructions.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 315039

clang/include/clang/Basic/BuiltinsX86.def

clang/lib/CodeGen/CGBuiltin.cpp

clang/lib/Headers/emmintrin.h

clang/lib/Headers/mmintrin.h

clang/lib/Headers/tmmintrin.h

clang/lib/Headers/xmmintrin.h

clang/test/CodeGen/X86/mmx-builtins.c

clang/test/CodeGen/X86/mmx-shift-with-immediate.c

clang/test/CodeGen/attr-target-x86-mmx.c

clang/test/Headers/xmmintrin.c

clang/test/Sema/x86-builtin-palignr.c

llvm/include/llvm/IR/IntrinsicsX86.td

mmx-tests/Makefile

mmx-tests/mmx-tests.py

mmx-tests/test.c

Convert __m64 intrinsics to unconditionally use SSE2 instead of MMX instructions.
Needs ReviewPublic