This is an archive of the discontinued LLVM Phabricator instance.

Patched clang to emit x86 blends as shufflevectors.
ClosedPublic

Authored by filcab on May 2 2014, 5:55 PM.

Download Raw Diff

Details

Reviewers

eli.friedman
craig.topper
• rafael

Commits

rG5d289b48b11f: Patched clang to emit x86 blends as shufflevectors.
rC208664: Patched clang to emit x86 blends as shufflevectors.
rL208664: Patched clang to emit x86 blends as shufflevectors.

Summary

Most of the clang header patch by Simon Pilgrim @ SCEE.
Also fixed (or added) clang tests for these intrinsics.

LLVM tests to make sure we get the blend instruction out of these
shufflevectors are at http://reviews.llvm.org/D3600

Diff Detail

Event Timeline

filcab updated this revision to Diff 9053.May 2 2014, 5:55 PM

filcab retitled this revision from to Patched clang to emit x86 blends as shufflevectors..

filcab updated this object.

filcab edited the test plan for this revision. (Show Details)

filcab added reviewers: eli.friedman, craig.topper.

filcab added a subscriber: Unknown Object (MLST).

Should code that is directly using the builtins themselves (like
builtin_ia32_pblendw256) be optimized too? If so wouldn't it be
better to, for example, leave _mm256_blend_epi16 as is, remove
builtin_ia32_pblendw256 from BuiltinsX86.def and make it a #define
to shufflevector?

Ah, I hadn't thought of that.
But it seems that the gcc manual explicitly says they're functions and that
they're available:
http://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/X86-Built-in-Functions.html#X86-Built-in-Functions
I'm not sure, but I suppose we should keep them if they're documented as
available, right? Or are we not maintaining compatibility here (our manual
doesn't mention these builtins, AFAICT)?

I also don't see any other intrinsic doing the same (which doesn't mean we
can't start now, obviously).

Filipe

Filipe points out that we cannot use a #define for the __builtin since it has to be available even when no .h is include. Optimizing the builtin would then require custom code in clang/llvm, which is probably not worth it.

lib/Headers/avx2intrin.h
163–179	Why the change to __m256d? The intel manual says the signature is m256i _mm256_blend_epi16 (m256i v1, __m256i v2, const int mask)
168	The masks looks wrong for the hight bits. Shouldn't this be (((M) & 0x01) ? 16 : 0), \ (((M) & 0x02) ? 17 : 1), \ (((M) & 0x04) ? 18 : 2), \ (((M) & 0x08) ? 19 : 3), \ (((M) & 0x10) ? 20 : 4), \ (((M) & 0x20) ? 21 : 5), \ (((M) & 0x40) ? 22 : 6), \ (((M) & 0x80) ? 23 : 7), \ (((M) & 0x01) ? 24 : 8), \ (((M) & 0x02) ? 25 : 9), \ (((M) & 0x04) ? 26 : 10), \ (((M) & 0x08) ? 27 : 11), \ (((M) & 0x10) ? 28 : 12), \ (((M) & 0x20) ? 29 : 13), \ (((M) & 0x40) ? 30 : 14), \ (((M) & 0x80) ? 31 : 15), \

Fixed masks in the blend shufflemasks.

Fixed a typo pointed by Rafael.

LGTM with a few small requests.

test/CodeGen/avx-builtins.c
118	Write the constant in hex, so it is easier to read. 57 is 0x39. So, the lower 4 bits are 1001. It would probably be test to use a non symmetrical constant in the test.

This revision is now accepted and ready to land.May 7 2014, 3:05 PM

I just realized that one advantage of expanding the _builtins in clang is that it would allow us to remove the redundant ones from llvm in the future. Can you leave a FIXME about it in one of the tests?

filcab closed this revision.May 12 2014, 7:44 PM

Revision Contents

Path

Size

lib/

Headers/

avx2intrin.h

34 lines

avxintrin.h

16 lines

smmintrin.h

20 lines

test/

CodeGen/

avx-builtins.c

12 lines

avx2-builtins.c

7 lines

sse-builtins.c

18 lines

Diff 9190

lib/Headers/avx2intrin.h

	Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines
	{			{
	return (__m256i)__builtin_ia32_pblendvb256((__v32qi)__V1, (__v32qi)__V2,			return (__m256i)__builtin_ia32_pblendvb256((__v32qi)__V1, (__v32qi)__V2,
	(__v32qi)__M);			(__v32qi)__M);
	}			}

	#define _mm256_blend_epi16(V1, V2, M) __extension__ ({ \			#define _mm256_blend_epi16(V1, V2, M) __extension__ ({ \
	__m256i __V1 = (V1); \			__m256i __V1 = (V1); \
	__m256i __V2 = (V2); \			__m256i __V2 = (V2); \
	(__m256i)__builtin_ia32_pblendw256((__v16hi)__V1, (__v16hi)__V2, (M)); })			(__m256d)__builtin_shufflevector((__v16hi)__V1, (__v16hi)__V2, \
				(((M) & 0x01) ? 16 : 0), \
				(((M) & 0x02) ? 17 : 1), \
				(((M) & 0x04) ? 18 : 2), \
				(((M) & 0x08) ? 19 : 3), \
				(((M) & 0x10) ? 20 : 4), \
				rafaelUnsubmitted Not Done Reply Inline Actions The masks looks wrong for the hight bits. Shouldn't this be (((M) & 0x01) ? 16 : 0), \ (((M) & 0x02) ? 17 : 1), \ (((M) & 0x04) ? 18 : 2), \ (((M) & 0x08) ? 19 : 3), \ (((M) & 0x10) ? 20 : 4), \ (((M) & 0x20) ? 21 : 5), \ (((M) & 0x40) ? 22 : 6), \ (((M) & 0x80) ? 23 : 7), \ (((M) & 0x01) ? 24 : 8), \ (((M) & 0x02) ? 25 : 9), \ (((M) & 0x04) ? 26 : 10), \ (((M) & 0x08) ? 27 : 11), \ (((M) & 0x10) ? 28 : 12), \ (((M) & 0x20) ? 29 : 13), \ (((M) & 0x40) ? 30 : 14), \ (((M) & 0x80) ? 31 : 15), \ rafael: The masks looks wrong for the hight bits. Shouldn't this be (((M) & 0x01) ? 16 : 0), \ (((M)…
				(((M) & 0x20) ? 21 : 5), \
				(((M) & 0x40) ? 22 : 6), \
				(((M) & 0x80) ? 23 : 7), \
				(((M) & 0x01) ? 24 : 8), \
				(((M) & 0x02) ? 25 : 9), \
				(((M) & 0x04) ? 26 : 10), \
				(((M) & 0x08) ? 27 : 11), \
				(((M) & 0x10) ? 28 : 12), \
				(((M) & 0x20) ? 29 : 13), \
				(((M) & 0x40) ? 30 : 14), \
				(((M) & 0x80) ? 31 : 15)); })
				rafaelUnsubmitted Not Done Reply Inline Actions Why the change to __m256d? The intel manual says the signature is m256i _mm256_blend_epi16 (m256i v1, __m256i v2, const int mask) rafael: Why the change to __m256d? The intel manual says the signature is __m256i _mm256_blend_epi16…

	static __inline__ __m256i __attribute__((__always_inline__, __nodebug__))			static __inline__ __m256i __attribute__((__always_inline__, __nodebug__))
	_mm256_cmpeq_epi8(__m256i __a, __m256i __b)			_mm256_cmpeq_epi8(__m256i __a, __m256i __b)
	{			{
	return (__m256i)((__v32qi)__a == (__v32qi)__b);			return (__m256i)((__v32qi)__a == (__v32qi)__b);
	}			}

	static __inline__ __m256i __attribute__((__always_inline__, __nodebug__))			static __inline__ __m256i __attribute__((__always_inline__, __nodebug__))
	▲ Show 20 Lines • Show All 584 Lines • ▼ Show 20 Lines
	_mm256_broadcastsi128_si256(__m128i __X)			_mm256_broadcastsi128_si256(__m128i __X)
	{			{
	return (__m256i)__builtin_ia32_vbroadcastsi256(__X);			return (__m256i)__builtin_ia32_vbroadcastsi256(__X);
	}			}

	#define _mm_blend_epi32(V1, V2, M) __extension__ ({ \			#define _mm_blend_epi32(V1, V2, M) __extension__ ({ \
	__m128i __V1 = (V1); \			__m128i __V1 = (V1); \
	__m128i __V2 = (V2); \			__m128i __V2 = (V2); \
	(__m128i)__builtin_ia32_pblendd128((__v4si)__V1, (__v4si)__V2, (M)); })			(__m128i)__builtin_shufflevector((__v4si)__V1, (__v4si)__V2, \
				(((M) & 0x01) ? 4 : 0), \
				(((M) & 0x02) ? 5 : 1), \
				(((M) & 0x04) ? 6 : 2), \
				(((M) & 0x08) ? 7 : 3)); })

	#define _mm256_blend_epi32(V1, V2, M) __extension__ ({ \			#define _mm256_blend_epi32(V1, V2, M) __extension__ ({ \
	__m256i __V1 = (V1); \			__m256i __V1 = (V1); \
	__m256i __V2 = (V2); \			__m256i __V2 = (V2); \
	(__m256i)__builtin_ia32_pblendd256((__v8si)__V1, (__v8si)__V2, (M)); })			(__m256i)__builtin_shufflevector((__v8si)__V1, (__v8si)__V2, \
				(((M) & 0x01) ? 8 : 0), \
				(((M) & 0x02) ? 9 : 1), \
				(((M) & 0x04) ? 10 : 2), \
				(((M) & 0x08) ? 11 : 3), \
				(((M) & 0x10) ? 12 : 4), \
				(((M) & 0x20) ? 13 : 5), \
				(((M) & 0x40) ? 14 : 6), \
				(((M) & 0x80) ? 15 : 7)); })

	static __inline__ __m256i __attribute__((__always_inline__, __nodebug__))			static __inline__ __m256i __attribute__((__always_inline__, __nodebug__))
	_mm256_broadcastb_epi8(__m128i __X)			_mm256_broadcastb_epi8(__m128i __X)
	{			{
	return (__m256i)__builtin_ia32_pbroadcastb256((__v16qi)__X);			return (__m256i)__builtin_ia32_pbroadcastb256((__v16qi)__X);
	}			}

	static __inline__ __m256i __attribute__((__always_inline__, __nodebug__))			static __inline__ __m256i __attribute__((__always_inline__, __nodebug__))
	▲ Show 20 Lines • Show All 429 Lines • Show Last 20 Lines

lib/Headers/avxintrin.h

Show First 20 Lines • Show All 302 Lines • ▼ Show 20 Lines	#define _mm256_permute2f128_si256(V1, V2, M) __extension__ ({ \
__m256i __V1 = (V1); \		__m256i __V1 = (V1); \
__m256i __V2 = (V2); \		__m256i __V2 = (V2); \
(__m256i)__builtin_ia32_vperm2f128_si256((__v8si)__V1, (__v8si)__V2, (M)); })		(__m256i)__builtin_ia32_vperm2f128_si256((__v8si)__V1, (__v8si)__V2, (M)); })

/* Vector Blend */		/* Vector Blend */
#define _mm256_blend_pd(V1, V2, M) __extension__ ({ \		#define _mm256_blend_pd(V1, V2, M) __extension__ ({ \
__m256d __V1 = (V1); \		__m256d __V1 = (V1); \
__m256d __V2 = (V2); \		__m256d __V2 = (V2); \
(__m256d)__builtin_ia32_blendpd256((__v4df)__V1, (__v4df)__V2, (M)); })		(__m256d)__builtin_shufflevector((__v4df)__V1, (__v4df)__V2, \
		(((M) & 0x01) ? 4 : 0), \
		(((M) & 0x02) ? 5 : 1), \
		(((M) & 0x04) ? 6 : 2), \
		(((M) & 0x08) ? 7 : 3)); })

#define _mm256_blend_ps(V1, V2, M) __extension__ ({ \		#define _mm256_blend_ps(V1, V2, M) __extension__ ({ \
__m256 __V1 = (V1); \		__m256 __V1 = (V1); \
__m256 __V2 = (V2); \		__m256 __V2 = (V2); \
(__m256)__builtin_ia32_blendps256((__v8sf)__V1, (__v8sf)__V2, (M)); })		(__m256)__builtin_shufflevector((__v8sf)__V1, (__v8sf)__V2, \
		(((M) & 0x01) ? 8 : 0), \
		(((M) & 0x02) ? 9 : 1), \
		(((M) & 0x04) ? 10 : 2), \
		(((M) & 0x08) ? 11 : 3), \
		(((M) & 0x10) ? 12 : 4), \
		(((M) & 0x20) ? 13 : 5), \
		(((M) & 0x40) ? 14 : 6), \
		(((M) & 0x80) ? 15 : 7)); })

static __inline __m256d __attribute__((__always_inline__, __nodebug__))		static __inline __m256d __attribute__((__always_inline__, __nodebug__))
_mm256_blendv_pd(__m256d __a, __m256d __b, __m256d __c)		_mm256_blendv_pd(__m256d __a, __m256d __b, __m256d __c)
{		{
return (__m256d)__builtin_ia32_blendvpd256(		return (__m256d)__builtin_ia32_blendvpd256(
(__v4df)__a, (__v4df)__b, (__v4df)__c);		(__v4df)__a, (__v4df)__b, (__v4df)__c);
}		}

▲ Show 20 Lines • Show All 900 Lines • Show Last 20 Lines

lib/Headers/smmintrin.h

Show First 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	#define _mm_round_sd(X, Y, M) __extension__ ({ \
__m128d __X = (X); \		__m128d __X = (X); \
__m128d __Y = (Y); \		__m128d __Y = (Y); \
(__m128d) __builtin_ia32_roundsd((__v2df)__X, (__v2df)__Y, (M)); })		(__m128d) __builtin_ia32_roundsd((__v2df)__X, (__v2df)__Y, (M)); })

/* SSE4 Packed Blending Intrinsics. */		/* SSE4 Packed Blending Intrinsics. */
#define _mm_blend_pd(V1, V2, M) __extension__ ({ \		#define _mm_blend_pd(V1, V2, M) __extension__ ({ \
__m128d __V1 = (V1); \		__m128d __V1 = (V1); \
__m128d __V2 = (V2); \		__m128d __V2 = (V2); \
(__m128d) __builtin_ia32_blendpd ((__v2df)__V1, (__v2df)__V2, (M)); })		(__m128d)__builtin_shufflevector((__v2df)__V1, (__v2df)__V2, \
		(((M) & 0x01) ? 2 : 0), \
		(((M) & 0x02) ? 3 : 1)); })

#define _mm_blend_ps(V1, V2, M) __extension__ ({ \		#define _mm_blend_ps(V1, V2, M) __extension__ ({ \
__m128 __V1 = (V1); \		__m128 __V1 = (V1); \
__m128 __V2 = (V2); \		__m128 __V2 = (V2); \
(__m128) __builtin_ia32_blendps ((__v4sf)__V1, (__v4sf)__V2, (M)); })		(__m128)__builtin_shufflevector((__v4sf)__V1, (__v4sf)__V2, \
		(((M) & 0x01) ? 4 : 0), \
		(((M) & 0x02) ? 5 : 1), \
		(((M) & 0x04) ? 6 : 2), \
		(((M) & 0x08) ? 7 : 3)); })

static __inline__ __m128d __attribute__((__always_inline__, __nodebug__))		static __inline__ __m128d __attribute__((__always_inline__, __nodebug__))
_mm_blendv_pd (__m128d __V1, __m128d __V2, __m128d __M)		_mm_blendv_pd (__m128d __V1, __m128d __V2, __m128d __M)
{		{
return (__m128d) __builtin_ia32_blendvpd ((__v2df)__V1, (__v2df)__V2,		return (__m128d) __builtin_ia32_blendvpd ((__v2df)__V1, (__v2df)__V2,
(__v2df)__M);		(__v2df)__M);
}		}

Show All 9 Lines
{		{
return (__m128i) __builtin_ia32_pblendvb128 ((__v16qi)__V1, (__v16qi)__V2,		return (__m128i) __builtin_ia32_pblendvb128 ((__v16qi)__V1, (__v16qi)__V2,
(__v16qi)__M);		(__v16qi)__M);
}		}

#define _mm_blend_epi16(V1, V2, M) __extension__ ({ \		#define _mm_blend_epi16(V1, V2, M) __extension__ ({ \
__m128i __V1 = (V1); \		__m128i __V1 = (V1); \
__m128i __V2 = (V2); \		__m128i __V2 = (V2); \
(__m128i) __builtin_ia32_pblendw128 ((__v8hi)__V1, (__v8hi)__V2, (M)); })		(__m128i)__builtin_shufflevector((__v8hi)__V1, (__v8hi)__V2, \
		(((M) & 0x01) ? 8 : 0), \
		(((M) & 0x02) ? 9 : 1), \
		(((M) & 0x04) ? 10 : 2), \
		(((M) & 0x08) ? 11 : 3), \
		(((M) & 0x10) ? 12 : 4), \
		(((M) & 0x20) ? 13 : 5), \
		(((M) & 0x40) ? 14 : 6), \
		(((M) & 0x80) ? 15 : 7)); })

/* SSE4 Dword Multiply Instructions. */		/* SSE4 Dword Multiply Instructions. */
static __inline__ __m128i __attribute__((__always_inline__, __nodebug__))		static __inline__ __m128i __attribute__((__always_inline__, __nodebug__))
_mm_mullo_epi32 (__m128i __V1, __m128i __V2)		_mm_mullo_epi32 (__m128i __V1, __m128i __V2)
{		{
return (__m128i) ((__v4si)__V1 * (__v4si)__V2);		return (__m128i) ((__v4si)__V1 * (__v4si)__V2);
}		}

▲ Show 20 Lines • Show All 347 Lines • Show Last 20 Lines

test/CodeGen/avx-builtins.c

Show First 20 Lines • Show All 105 Lines • ▼ Show 20 Lines	int test_extract_epi16(__m256i __a) {
return _mm256_extract_epi16(__a, 16);		return _mm256_extract_epi16(__a, 16);
}		}

int test_extract_epi8(__m256i __a) {		int test_extract_epi8(__m256i __a) {
// CHECK-LABEL: @test_extract_epi8		// CHECK-LABEL: @test_extract_epi8
// CHECK: extractelement <32 x i8> %{{.*}}, i32 0		// CHECK: extractelement <32 x i8> %{{.*}}, i32 0
return _mm256_extract_epi8(__a, 32);		return _mm256_extract_epi8(__a, 32);
}		}

		__m256d test_256_blend_pd(__m256d __a, __m256d __b) {
		// CHECK-LABEL: @test_256_blend_pd
		// CHECK: shufflevector <4 x double> %{{.}}, <4 x double> %{{.}}, <4 x i32> <i32 4, i32 1, i32 2, i32 7>
		return _mm256_blend_pd(__a, __b, 57);
		rafaelUnsubmitted Not Done Reply Inline Actions Write the constant in hex, so it is easier to read. 57 is 0x39. So, the lower 4 bits are 1001. It would probably be test to use a non symmetrical constant in the test. rafael: Write the constant in hex, so it is easier to read. 57 is 0x39. So, the lower 4 bits are 1001.
		}

		__m256 test_256_blend_ps(__m256 __a, __m256 __b) {
		// CHECK-LABEL: @test_256_blend_ps
		// CHECK: shufflevector <8 x float> %{{.}}, <8 x float> %{{.}}, <8 x i32> <i32 8, i32 1, i32 2, i32 11, i32 12, i32 13, i32 6, i32 7>
		return _mm256_blend_ps(__a, __b, 57);
		}

test/CodeGen/avx2-builtins.c

	Show First 20 Lines • Show All 171 Lines • ▼ Show 20 Lines
	}			}

	__m256i test_mm256_blendv_epi8(__m256i a, __m256i b, __m256i m) {			__m256i test_mm256_blendv_epi8(__m256i a, __m256i b, __m256i m) {
	// CHECK: @llvm.x86.avx2.pblendvb			// CHECK: @llvm.x86.avx2.pblendvb
	return _mm256_blendv_epi8(a, b, m);			return _mm256_blendv_epi8(a, b, m);
	}			}

	__m256i test_mm256_blend_epi16(__m256i a, __m256i b) {			__m256i test_mm256_blend_epi16(__m256i a, __m256i b) {
	// CHECK: @llvm.x86.avx2.pblendw(<16 x i16> %{{.}}, <16 x i16> %{{.}}, i32 2)			// CHECK-LABEL: test_mm256_blend_epi16
				// CHECK: shufflevector <16 x i16> %{{.}}, <16 x i16> %{{.}}, <16 x i32> <i32 0, i32 17, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 25, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	return _mm256_blend_epi16(a, b, 2);			return _mm256_blend_epi16(a, b, 2);
	}			}

	__m256i test_mm256_cmpeq_epi8(__m256i a, __m256i b) {			__m256i test_mm256_cmpeq_epi8(__m256i a, __m256i b) {
	// CHECK: icmp eq <32 x i8>			// CHECK: icmp eq <32 x i8>
	return _mm256_cmpeq_epi8(a, b);			return _mm256_cmpeq_epi8(a, b);
	}			}

	▲ Show 20 Lines • Show All 418 Lines • ▼ Show 20 Lines
	}			}

	__m256i test_mm256_broadcastsi128_si256(__m128i a) {			__m256i test_mm256_broadcastsi128_si256(__m128i a) {
	// CHECK: @llvm.x86.avx2.vbroadcasti128			// CHECK: @llvm.x86.avx2.vbroadcasti128
	return _mm256_broadcastsi128_si256(a);			return _mm256_broadcastsi128_si256(a);
	}			}

	__m128i test_mm_blend_epi32(__m128i a, __m128i b) {			__m128i test_mm_blend_epi32(__m128i a, __m128i b) {
	// CHECK: @llvm.x86.avx2.pblendd.128			// CHECK: shufflevector <4 x i32> %{{.}}, <4 x i32> %{{.}}, <4 x i32> <i32 4, i32 1, i32 2, i32 7>
	return _mm_blend_epi32(a, b, 57);			return _mm_blend_epi32(a, b, 57);
	}			}

	__m256i test_mm256_blend_epi32(__m256i a, __m256i b) {			__m256i test_mm256_blend_epi32(__m256i a, __m256i b) {
	// CHECK: @llvm.x86.avx2.pblendd.256			// CHECK: shufflevector <8 x i32> %{{.}}, <8 x i32> %{{.}}, <8 x i32> <i32 8, i32 1, i32 2, i32 11, i32 12, i32 13, i32 6, i32 7>
	return _mm256_blend_epi32(a, b, 57);			return _mm256_blend_epi32(a, b, 57);
	}			}

	__m256i test_mm256_broadcastb_epi8(__m128i a) {			__m256i test_mm256_broadcastb_epi8(__m128i a) {
	// CHECK: @llvm.x86.avx2.pbroadcastb.256			// CHECK: @llvm.x86.avx2.pbroadcastb.256
	return _mm256_broadcastb_epi8(a);			return _mm256_broadcastb_epi8(a);
	}			}

	▲ Show 20 Lines • Show All 310 Lines • Show Last 20 Lines

test/CodeGen/sse-builtins.c

Show First 20 Lines • Show All 231 Lines • ▼ Show 20 Lines	int test_extract_epi32(__m128i __a) {
return _mm_extract_epi32(__a, 4);		return _mm_extract_epi32(__a, 4);
}		}

void test_insert_epi32(__m128i __a, int b) {		void test_insert_epi32(__m128i __a, int b) {
// CHECK-LABEL: @test_insert_epi32		// CHECK-LABEL: @test_insert_epi32
// CHECK: insertelement <4 x i32> %{{.}}, i32 %{{.}}, i32 0		// CHECK: insertelement <4 x i32> %{{.}}, i32 %{{.}}, i32 0
_mm_insert_epi32(__a, b, 4);		_mm_insert_epi32(__a, b, 4);
}		}

		__m128d test_blend_pd(__m128d V1, __m128d V2) {
		// CHECK-LABEL: @test_blend_pd
		// CHECK: shufflevector <2 x double> %{{.}}, <2 x double> %{{.}}, <2 x i32> <i32 2, i32 1>
		return _mm_blend_pd(V1, V2, 1);
		}

		__m128 test_blend_ps(__m128 V1, __m128 V2) {
		// CHECK-LABEL: @test_blend_ps
		// CHECK: shufflevector <4 x float> %{{.}}, <4 x float> %{{.}}, <4 x i32> <i32 4, i32 1, i32 6, i32 3>
		return _mm_blend_ps(V1, V2, 5);
		}

		__m128i test_blend_epi16(__m128i V1, __m128i V2) {
		// CHECK-LABEL: @test_blend_epi16
		// CHECK: shufflevector <8 x i16> %{{.}}, <8 x i16> %{{.}}, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 7>
		return _mm_blend_epi16(V1, V2, 42);
		}