This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/
-
CodeGen/
4/4
CGBuiltin.cpp
-
Headers/
7/7
avx512fintrin.h
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
avx512-reduceIntrin.c

Differential D96231

[X86] Always assign reassoc flag for intrinsics *reduce_add/mul_ps/pd.
ClosedPublic

Authored by pengfei on Feb 7 2021, 6:58 PM.

Download Raw Diff

Details

Reviewers

RKSimon
craig.topper
spatel

Commits

rGdd2460ed5d77: [X86] Always assign reassoc flag for intrinsics *reduce_add/mul_ps/pd.

Summary

Intrinsics *reduce_add/mul_ps/pd have assumption that the elements in
the vector are reassociable. So we need to always assign the reassoc
flag when we call _mm_reduce_* intrinsics.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

pengfei requested review of this revision.Feb 7 2021, 6:58 PM

pengfei created this revision.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2021, 6:58 PM

Herald added a subscriber: cfe-commits. · View Herald Transcript

pengfei mentioned this in D93179: [X86] Convert fmin/fmax _mm_reduce_* intrinsics to emit llvm.reduction intrinsics (PR47506).Feb 7 2021, 7:01 PM

Harbormaster completed remote builds in B88234: Diff 322015.Feb 7 2021, 7:32 PM

spatel added inline comments.Feb 8 2021, 8:28 AM

clang/lib/CodeGen/CGBuiltin.cpp
13829	I haven't looked at this part of the compiler in a long time, so I was wondering how we handle FMF scope. It looks like there is already an FMFGuard object in place -- CodeGenFunction::CGFPOptionsRAII(). So setting FMF here will not affect anything but this CreateCall(). Does that match your understanding? Should we have an extra regression test to make sure that does not change? I am imagining something like: double test_mm512_reduce_add_pd(__m512d __W, double ExtraAddOp) { double S = _mm512_reduce_add_pd(__W) + ExtraAddOp; return S; } Then we could confirm that `reassoc` is not applied to the `fadd` that follows the reduction call.
13829	Currently (and we could say that this is an LLVM codegen bug), we will not generate the optimal/expected reduction with `reassoc` alone. I think the x86 reduction definition is implicitly assuming that -0.0 is not meaningful here, so we should add `nsz` too. The backend is expecting an explicit `nsz` on this op. Ie, I see this x86 asm currently with only `reassoc`: vextractf64x4 $1, %zmm0, %ymm1 vaddpd %zmm1, %zmm0, %zmm0 vextractf128 $1, %ymm0, %xmm1 vaddpd %xmm1, %xmm0, %xmm0 vpermilpd $1, %xmm0, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 vxorpd %xmm1, %xmm1, %xmm1 <--- create 0.0 vaddsd %xmm1, %xmm0, %xmm0 <--- add it to the reduction result Alternatively (and I'm not sure where it is specified), we could replace the default 0.0 argument with -0.0?
clang/lib/Headers/avx512fintrin.h
9300	This is an existing text bug, but if we are changing this text, we might as well fix it in this patch - I'm not sure what "off" refers to here. Should that be "order"?
9303	Typo: "floating-point types"
9304	Also mention that sign of zero is indeterminate. We might use the LangRef text as a model for what to say here: https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fadd-intrinsic

spatel added inline comments.Feb 8 2021, 8:51 AM

clang/lib/Headers/avx512fintrin.h
9355	Ah - this is where the +0.0 is specified. This should be -0.0. We could still add 'nsz' flag to be safe.
9365	This also should be changed to -0.0?

Address Sanjay's comments. Thanks for the thoroughly review!

clang/lib/CodeGen/CGBuiltin.cpp
13829	Confirmed by new tests.
13829	I think there's no such assumption for fadd/fmul instructions. We do have it for fmin/fmax. So I think we don't need to add nsz here.
clang/lib/Headers/avx512fintrin.h
9304	Got it. Thanks!
9355	-0.0 can fix the problem. But we don't need to add 'nsz'. We can add it if we can find a corner case.

LGTM

This revision is now accepted and ready to land.Feb 9 2021, 4:52 AM

Harbormaster completed remote builds in B88443: Diff 322344.Feb 9 2021, 5:12 AM

This revision was landed with ongoing or failed builds.Feb 9 2021, 5:14 AM

Closed by commit rGdd2460ed5d77: [X86] Always assign reassoc flag for intrinsics *reduce_add/mul_ps/pd. (authored by Wang, Pengfei <pengfei.wang@intel.com>). · Explain Why

This revision was automatically updated to reflect the committed changes.

pengfei added a commit: rGdd2460ed5d77: [X86] Always assign reassoc flag for intrinsics *reduce_add/mul_ps/pd..

qiucf mentioned this in D101209: [PowerPC] Provide fastmath sqrt and div functions in altivec.h.Apr 25 2021, 9:31 PM

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CGBuiltin.cpp

2 lines

Headers/

avx512fintrin.h

16 lines

test/

CodeGen/

X86/

avx512-reduceIntrin.c

68 lines

Diff 322350

clang/lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,820 Lines • ▼ Show 20 Lines	case X86::BI__builtin_ia32_reduce_and_q512: {
Function *F =		Function *F =
CGM.getIntrinsic(Intrinsic::vector_reduce_and, Ops[0]->getType());		CGM.getIntrinsic(Intrinsic::vector_reduce_and, Ops[0]->getType());
return Builder.CreateCall(F, {Ops[0]});		return Builder.CreateCall(F, {Ops[0]});
}		}
case X86::BI__builtin_ia32_reduce_fadd_pd512:		case X86::BI__builtin_ia32_reduce_fadd_pd512:
case X86::BI__builtin_ia32_reduce_fadd_ps512: {		case X86::BI__builtin_ia32_reduce_fadd_ps512: {
Function *F =		Function *F =
CGM.getIntrinsic(Intrinsic::vector_reduce_fadd, Ops[1]->getType());		CGM.getIntrinsic(Intrinsic::vector_reduce_fadd, Ops[1]->getType());
		Builder.getFastMathFlags().setAllowReassoc(true);
		spatelUnsubmitted Done Reply Inline Actions I haven't looked at this part of the compiler in a long time, so I was wondering how we handle FMF scope. It looks like there is already an FMFGuard object in place -- CodeGenFunction::CGFPOptionsRAII(). So setting FMF here will not affect anything but this CreateCall(). Does that match your understanding? Should we have an extra regression test to make sure that does not change? I am imagining something like: double test_mm512_reduce_add_pd(__m512d __W, double ExtraAddOp) { double S = _mm512_reduce_add_pd(__W) + ExtraAddOp; return S; } Then we could confirm that `reassoc` is not applied to the `fadd` that follows the reduction call. spatel: I haven't looked at this part of the compiler in a long time, so I was wondering how we handle…
		pengfeiAuthorUnsubmitted Done Reply Inline Actions Confirmed by new tests. pengfei: Confirmed by new tests.
		spatelUnsubmitted Done Reply Inline Actions Currently (and we could say that this is an LLVM codegen bug), we will not generate the optimal/expected reduction with `reassoc` alone. I think the x86 reduction definition is implicitly assuming that -0.0 is not meaningful here, so we should add `nsz` too. The backend is expecting an explicit `nsz` on this op. Ie, I see this x86 asm currently with only `reassoc`: vextractf64x4 $1, %zmm0, %ymm1 vaddpd %zmm1, %zmm0, %zmm0 vextractf128 $1, %ymm0, %xmm1 vaddpd %xmm1, %xmm0, %xmm0 vpermilpd $1, %xmm0, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 vxorpd %xmm1, %xmm1, %xmm1 <--- create 0.0 vaddsd %xmm1, %xmm0, %xmm0 <--- add it to the reduction result Alternatively (and I'm not sure where it is specified), we could replace the default 0.0 argument with -0.0? spatel: Currently (and we could say that this is an LLVM codegen bug), we will not generate the…
		pengfeiAuthorUnsubmitted Done Reply Inline Actions I think there's no such assumption for fadd/fmul instructions. We do have it for fmin/fmax. So I think we don't need to add nsz here. pengfei: I think there's no such assumption for fadd/fmul instructions. We do have it for fmin/fmax. So…
return Builder.CreateCall(F, {Ops[0], Ops[1]});		return Builder.CreateCall(F, {Ops[0], Ops[1]});
}		}
case X86::BI__builtin_ia32_reduce_fmul_pd512:		case X86::BI__builtin_ia32_reduce_fmul_pd512:
case X86::BI__builtin_ia32_reduce_fmul_ps512: {		case X86::BI__builtin_ia32_reduce_fmul_ps512: {
Function *F =		Function *F =
CGM.getIntrinsic(Intrinsic::vector_reduce_fmul, Ops[1]->getType());		CGM.getIntrinsic(Intrinsic::vector_reduce_fmul, Ops[1]->getType());
		Builder.getFastMathFlags().setAllowReassoc(true);
return Builder.CreateCall(F, {Ops[0], Ops[1]});		return Builder.CreateCall(F, {Ops[0], Ops[1]});
}		}
case X86::BI__builtin_ia32_reduce_mul_d512:		case X86::BI__builtin_ia32_reduce_mul_d512:
case X86::BI__builtin_ia32_reduce_mul_q512: {		case X86::BI__builtin_ia32_reduce_mul_q512: {
Function *F =		Function *F =
CGM.getIntrinsic(Intrinsic::vector_reduce_mul, Ops[0]->getType());		CGM.getIntrinsic(Intrinsic::vector_reduce_mul, Ops[0]->getType());
return Builder.CreateCall(F, {Ops[0]});		return Builder.CreateCall(F, {Ops[0]});
}		}
▲ Show 20 Lines • Show All 3,802 Lines • Show Last 20 Lines

clang/lib/Headers/avx512fintrin.h

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 9,291 Lines • ▼ Show 20 Lines
	static __inline__ __m512d __DEFAULT_FN_ATTRS512			static __inline__ __m512d __DEFAULT_FN_ATTRS512
	_mm512_mask_abs_pd(__m512d __W, __mmask8 __K, __m512d __A)			_mm512_mask_abs_pd(__m512d __W, __mmask8 __K, __m512d __A)
	{			{
	return (__m512d)_mm512_mask_and_epi64((__v8di)__W, __K, _mm512_set1_epi64(0x7FFFFFFFFFFFFFFF),(__v8di)__A);			return (__m512d)_mm512_mask_and_epi64((__v8di)__W, __K, _mm512_set1_epi64(0x7FFFFFFFFFFFFFFF),(__v8di)__A);
	}			}

	/* Vector-reduction arithmetic accepts vectors as inputs and produces scalars as			/* Vector-reduction arithmetic accepts vectors as inputs and produces scalars as
	* outputs. This class of vector operation forms the basis of many scientific			* outputs. This class of vector operation forms the basis of many scientific
	* computations. In vector-reduction arithmetic, the evaluation off is			* computations. In vector-reduction arithmetic, the evaluation order is
				spatelUnsubmitted Done Reply Inline Actions This is an existing text bug, but if we are changing this text, we might as well fix it in this patch - I'm not sure what "off" refers to here. Should that be "order"? spatel: This is an existing text bug, but if we are changing this text, we might as well fix it in this…
	* independent of the order of the input elements of V.			* independent of the order of the input elements of V.

				* For floating point types, we always assume the elements are reassociable even
				spatelUnsubmitted Done Reply Inline Actions Typo: "floating-point types" spatel: Typo: "floating-point types"
				* if -fast-math is off.
				spatelUnsubmitted Done Reply Inline Actions Also mention that sign of zero is indeterminate. We might use the LangRef text as a model for what to say here: https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fadd-intrinsic spatel: Also mention that sign of zero is indeterminate. We might use the LangRef text as a model for…
				pengfeiAuthorUnsubmitted Done Reply Inline Actions Got it. Thanks! pengfei: Got it. Thanks!

	* Used bisection method. At each step, we partition the vector with previous			* Used bisection method. At each step, we partition the vector with previous
	* step in half, and the operation is performed on its two halves.			* step in half, and the operation is performed on its two halves.
	* This takes log2(n) steps where n is the number of elements in the vector.			* This takes log2(n) steps where n is the number of elements in the vector.
	*/			*/

	static __inline__ long long __DEFAULT_FN_ATTRS512 _mm512_reduce_add_epi64(__m512i __W) {			static __inline__ long long __DEFAULT_FN_ATTRS512 _mm512_reduce_add_epi64(__m512i __W) {
	return __builtin_ia32_reduce_add_q512(__W);			return __builtin_ia32_reduce_add_q512(__W);
	}			}
	Show All 29 Lines
	}			}

	static __inline__ long long __DEFAULT_FN_ATTRS512			static __inline__ long long __DEFAULT_FN_ATTRS512
	_mm512_mask_reduce_or_epi64(__mmask8 __M, __m512i __W) {			_mm512_mask_reduce_or_epi64(__mmask8 __M, __m512i __W) {
	__W = _mm512_maskz_mov_epi64(__M, __W);			__W = _mm512_maskz_mov_epi64(__M, __W);
	return __builtin_ia32_reduce_or_q512(__W);			return __builtin_ia32_reduce_or_q512(__W);
	}			}

				// -0.0 is used to ignore the start value since it is the neutral value of
				// floating point addition. For more information, please refer to
				// https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fadd-intrinsic
	static __inline__ double __DEFAULT_FN_ATTRS512 _mm512_reduce_add_pd(__m512d __W) {			static __inline__ double __DEFAULT_FN_ATTRS512 _mm512_reduce_add_pd(__m512d __W) {
	return __builtin_ia32_reduce_fadd_pd512(0.0, __W);			return __builtin_ia32_reduce_fadd_pd512(-0.0, __W);
				spatelUnsubmitted Done Reply Inline Actions Ah - this is where the +0.0 is specified. This should be -0.0. We could still add 'nsz' flag to be safe. spatel: Ah - this is where the +0.0 is specified. This should be -0.0. We could still add 'nsz' flag to…
				pengfeiAuthorUnsubmitted Done Reply Inline Actions -0.0 can fix the problem. But we don't need to add 'nsz'. We can add it if we can find a corner case. pengfei: -0.0 can fix the problem. But we don't need to add 'nsz'. We can add it if we can find a corner…
	}			}

	static __inline__ double __DEFAULT_FN_ATTRS512 _mm512_reduce_mul_pd(__m512d __W) {			static __inline__ double __DEFAULT_FN_ATTRS512 _mm512_reduce_mul_pd(__m512d __W) {
	return __builtin_ia32_reduce_fmul_pd512(1.0, __W);			return __builtin_ia32_reduce_fmul_pd512(1.0, __W);
	}			}

	static __inline__ double __DEFAULT_FN_ATTRS512			static __inline__ double __DEFAULT_FN_ATTRS512
	_mm512_mask_reduce_add_pd(__mmask8 __M, __m512d __W) {			_mm512_mask_reduce_add_pd(__mmask8 __M, __m512d __W) {
	__W = _mm512_maskz_mov_pd(__M, __W);			__W = _mm512_maskz_mov_pd(__M, __W);
	return __builtin_ia32_reduce_fadd_pd512(0.0, __W);			return __builtin_ia32_reduce_fadd_pd512(-0.0, __W);
				spatelUnsubmitted Done Reply Inline Actions This also should be changed to -0.0? spatel: This also should be changed to -0.0?
	}			}

	static __inline__ double __DEFAULT_FN_ATTRS512			static __inline__ double __DEFAULT_FN_ATTRS512
	_mm512_mask_reduce_mul_pd(__mmask8 __M, __m512d __W) {			_mm512_mask_reduce_mul_pd(__mmask8 __M, __m512d __W) {
	__W = _mm512_mask_mov_pd(_mm512_set1_pd(1.0), __M, __W);			__W = _mm512_mask_mov_pd(_mm512_set1_pd(1.0), __M, __W);
	return __builtin_ia32_reduce_fmul_pd512(1.0, __W);			return __builtin_ia32_reduce_fmul_pd512(1.0, __W);
	}			}

	Show All 38 Lines
	static __inline__ int __DEFAULT_FN_ATTRS512			static __inline__ int __DEFAULT_FN_ATTRS512
	_mm512_mask_reduce_or_epi32(__mmask16 __M, __m512i __W) {			_mm512_mask_reduce_or_epi32(__mmask16 __M, __m512i __W) {
	__W = _mm512_maskz_mov_epi32(__M, __W);			__W = _mm512_maskz_mov_epi32(__M, __W);
	return __builtin_ia32_reduce_or_d512((__v16si)__W);			return __builtin_ia32_reduce_or_d512((__v16si)__W);
	}			}

	static __inline__ float __DEFAULT_FN_ATTRS512			static __inline__ float __DEFAULT_FN_ATTRS512
	_mm512_reduce_add_ps(__m512 __W) {			_mm512_reduce_add_ps(__m512 __W) {
	return __builtin_ia32_reduce_fadd_ps512(0.0f, __W);			return __builtin_ia32_reduce_fadd_ps512(-0.0f, __W);
	}			}

	static __inline__ float __DEFAULT_FN_ATTRS512			static __inline__ float __DEFAULT_FN_ATTRS512
	_mm512_reduce_mul_ps(__m512 __W) {			_mm512_reduce_mul_ps(__m512 __W) {
	return __builtin_ia32_reduce_fmul_ps512(1.0f, __W);			return __builtin_ia32_reduce_fmul_ps512(1.0f, __W);
	}			}

	static __inline__ float __DEFAULT_FN_ATTRS512			static __inline__ float __DEFAULT_FN_ATTRS512
	_mm512_mask_reduce_add_ps(__mmask16 __M, __m512 __W) {			_mm512_mask_reduce_add_ps(__mmask16 __M, __m512 __W) {
	__W = _mm512_maskz_mov_ps(__M, __W);			__W = _mm512_maskz_mov_ps(__M, __W);
	return __builtin_ia32_reduce_fadd_ps512(0.0f, __W);			return __builtin_ia32_reduce_fadd_ps512(-0.0f, __W);
	}			}

	static __inline__ float __DEFAULT_FN_ATTRS512			static __inline__ float __DEFAULT_FN_ATTRS512
	_mm512_mask_reduce_mul_ps(__mmask16 __M, __m512 __W) {			_mm512_mask_reduce_mul_ps(__mmask16 __M, __m512 __W) {
	__W = _mm512_mask_mov_ps(_mm512_set1_ps(1.0f), __M, __W);			__W = _mm512_mask_mov_ps(_mm512_set1_ps(1.0f), __M, __W);
	return __builtin_ia32_reduce_fmul_ps512(1.0f, __W);			return __builtin_ia32_reduce_fmul_ps512(1.0f, __W);
	}			}

	▲ Show 20 Lines • Show All 179 Lines • Show Last 20 Lines

clang/test/CodeGen/X86/avx512-reduceIntrin.c

	// RUN: %clang_cc1 -ffreestanding %s -O0 -triple=x86_64-apple-darwin -target-cpu skylake-avx512 -emit-llvm -o - -Wall -Werror \| FileCheck %s			// RUN: %clang_cc1 -ffreestanding %s -O0 -triple=x86_64-apple-darwin -target-cpu skylake-avx512 -emit-llvm -o - -Wall -Werror \| FileCheck %s

	#include <immintrin.h>			#include <immintrin.h>

	long long test_mm512_reduce_add_epi64(__m512i __W){			long long test_mm512_reduce_add_epi64(__m512i __W){
	// CHECK-LABEL: @test_mm512_reduce_add_epi64(			// CHECK-LABEL: @test_mm512_reduce_add_epi64(
	// CHECK: call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %{{.*}})			// CHECK: call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %{{.*}})
	return _mm512_reduce_add_epi64(__W);			return _mm512_reduce_add_epi64(__W);
	}			}

	long long test_mm512_reduce_mul_epi64(__m512i __W){			long long test_mm512_reduce_mul_epi64(__m512i __W){
	// CHECK-LABEL: @test_mm512_reduce_mul_epi64(			// CHECK-LABEL: @test_mm512_reduce_mul_epi64(
	// CHECK: call i64 @llvm.vector.reduce.mul.v8i64(<8 x i64> %{{.*}})			// CHECK: call i64 @llvm.vector.reduce.mul.v8i64(<8 x i64> %{{.*}})
	return _mm512_reduce_mul_epi64(__W);			return _mm512_reduce_mul_epi64(__W);
	}			}

	long long test_mm512_reduce_or_epi64(__m512i __W){			long long test_mm512_reduce_or_epi64(__m512i __W){
	// CHECK-LABEL: @test_mm512_reduce_or_epi64(			// CHECK-LABEL: @test_mm512_reduce_or_epi64(
	// CHECK: call i64 @llvm.vector.reduce.or.v8i64(<8 x i64> %{{.*}})			// CHECK: call i64 @llvm.vector.reduce.or.v8i64(<8 x i64> %{{.*}})
	return _mm512_reduce_or_epi64(__W);			return _mm512_reduce_or_epi64(__W);
	}			}

	long long test_mm512_reduce_and_epi64(__m512i __W){			long long test_mm512_reduce_and_epi64(__m512i __W){
	// CHECK-LABEL: @test_mm512_reduce_and_epi64(			// CHECK-LABEL: @test_mm512_reduce_and_epi64(
	// CHECK: call i64 @llvm.vector.reduce.and.v8i64(<8 x i64> %{{.*}})			// CHECK: call i64 @llvm.vector.reduce.and.v8i64(<8 x i64> %{{.*}})
	return _mm512_reduce_and_epi64(__W);			return _mm512_reduce_and_epi64(__W);
	}			}

	long long test_mm512_mask_reduce_add_epi64(__mmask8 __M, __m512i __W){			long long test_mm512_mask_reduce_add_epi64(__mmask8 __M, __m512i __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_add_epi64(			// CHECK-LABEL: @test_mm512_mask_reduce_add_epi64(
	// CHECK: bitcast i8 %{{.*}} to <8 x i1>			// CHECK: bitcast i8 %{{.*}} to <8 x i1>
	// CHECK: select <8 x i1> %{{.}}, <8 x i64> %{{.}}, <8 x i64> %{{.*}}			// CHECK: select <8 x i1> %{{.}}, <8 x i64> %{{.}}, <8 x i64> %{{.*}}
	// CHECK: call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %{{.*}})			// CHECK: call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %{{.*}})
	return _mm512_mask_reduce_add_epi64(__M, __W);			return _mm512_mask_reduce_add_epi64(__M, __W);
	}			}

	long long test_mm512_mask_reduce_mul_epi64(__mmask8 __M, __m512i __W){			long long test_mm512_mask_reduce_mul_epi64(__mmask8 __M, __m512i __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_mul_epi64(			// CHECK-LABEL: @test_mm512_mask_reduce_mul_epi64(
	// CHECK: bitcast i8 %{{.*}} to <8 x i1>			// CHECK: bitcast i8 %{{.*}} to <8 x i1>
	// CHECK: select <8 x i1> %{{.}}, <8 x i64> %{{.}}, <8 x i64> %{{.*}}			// CHECK: select <8 x i1> %{{.}}, <8 x i64> %{{.}}, <8 x i64> %{{.*}}
	// CHECK: call i64 @llvm.vector.reduce.mul.v8i64(<8 x i64> %{{.*}})			// CHECK: call i64 @llvm.vector.reduce.mul.v8i64(<8 x i64> %{{.*}})
	return _mm512_mask_reduce_mul_epi64(__M, __W);			return _mm512_mask_reduce_mul_epi64(__M, __W);
	}			}

	long long test_mm512_mask_reduce_and_epi64(__mmask8 __M, __m512i __W){			long long test_mm512_mask_reduce_and_epi64(__mmask8 __M, __m512i __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_and_epi64(			// CHECK-LABEL: @test_mm512_mask_reduce_and_epi64(
	// CHECK: bitcast i8 %{{.*}} to <8 x i1>			// CHECK: bitcast i8 %{{.*}} to <8 x i1>
	// CHECK: select <8 x i1> %{{.}}, <8 x i64> %{{.}}, <8 x i64> %{{.*}}			// CHECK: select <8 x i1> %{{.}}, <8 x i64> %{{.}}, <8 x i64> %{{.*}}
	// CHECK: call i64 @llvm.vector.reduce.and.v8i64(<8 x i64> %{{.*}})			// CHECK: call i64 @llvm.vector.reduce.and.v8i64(<8 x i64> %{{.*}})
	return _mm512_mask_reduce_and_epi64(__M, __W);			return _mm512_mask_reduce_and_epi64(__M, __W);
	}			}

	long long test_mm512_mask_reduce_or_epi64(__mmask8 __M, __m512i __W){			long long test_mm512_mask_reduce_or_epi64(__mmask8 __M, __m512i __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_or_epi64(			// CHECK-LABEL: @test_mm512_mask_reduce_or_epi64(
	// CHECK: bitcast i8 %{{.*}} to <8 x i1>			// CHECK: bitcast i8 %{{.*}} to <8 x i1>
	// CHECK: select <8 x i1> %{{.}}, <8 x i64> %{{.}}, <8 x i64> %{{.*}}			// CHECK: select <8 x i1> %{{.}}, <8 x i64> %{{.}}, <8 x i64> %{{.*}}
	// CHECK: call i64 @llvm.vector.reduce.or.v8i64(<8 x i64> %{{.*}})			// CHECK: call i64 @llvm.vector.reduce.or.v8i64(<8 x i64> %{{.*}})
	return _mm512_mask_reduce_or_epi64(__M, __W);			return _mm512_mask_reduce_or_epi64(__M, __W);
	}			}

	int test_mm512_reduce_add_epi32(__m512i __W){			int test_mm512_reduce_add_epi32(__m512i __W){
	// CHECK-LABEL: @test_mm512_reduce_add_epi32(			// CHECK-LABEL: @test_mm512_reduce_add_epi32(
	// CHECK: call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %{{.*}})			// CHECK: call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %{{.*}})
	return _mm512_reduce_add_epi32(__W);			return _mm512_reduce_add_epi32(__W);
	}			}

	int test_mm512_reduce_mul_epi32(__m512i __W){			int test_mm512_reduce_mul_epi32(__m512i __W){
	// CHECK-LABEL: @test_mm512_reduce_mul_epi32(			// CHECK-LABEL: @test_mm512_reduce_mul_epi32(
	// CHECK: call i32 @llvm.vector.reduce.mul.v16i32(<16 x i32> %{{.*}})			// CHECK: call i32 @llvm.vector.reduce.mul.v16i32(<16 x i32> %{{.*}})
	return _mm512_reduce_mul_epi32(__W);			return _mm512_reduce_mul_epi32(__W);
	}			}

	int test_mm512_reduce_or_epi32(__m512i __W){			int test_mm512_reduce_or_epi32(__m512i __W){
	// CHECK: call i32 @llvm.vector.reduce.or.v16i32(<16 x i32> %{{.*}})			// CHECK: call i32 @llvm.vector.reduce.or.v16i32(<16 x i32> %{{.*}})
	return _mm512_reduce_or_epi32(__W);			return _mm512_reduce_or_epi32(__W);
	}			}

	int test_mm512_reduce_and_epi32(__m512i __W){			int test_mm512_reduce_and_epi32(__m512i __W){
	// CHECK-LABEL: @test_mm512_reduce_and_epi32(			// CHECK-LABEL: @test_mm512_reduce_and_epi32(
	// CHECK: call i32 @llvm.vector.reduce.and.v16i32(<16 x i32> %{{.*}})			// CHECK: call i32 @llvm.vector.reduce.and.v16i32(<16 x i32> %{{.*}})
	return _mm512_reduce_and_epi32(__W);			return _mm512_reduce_and_epi32(__W);
	}			}

	int test_mm512_mask_reduce_add_epi32(__mmask16 __M, __m512i __W){			int test_mm512_mask_reduce_add_epi32(__mmask16 __M, __m512i __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_add_epi32(			// CHECK-LABEL: @test_mm512_mask_reduce_add_epi32(
	// CHECK: bitcast i16 %{{.*}} to <16 x i1>			// CHECK: bitcast i16 %{{.*}} to <16 x i1>
	// CHECK: select <16 x i1> %{{.}}, <16 x i32> %{{.}}, <16 x i32> %{{.*}}			// CHECK: select <16 x i1> %{{.}}, <16 x i32> %{{.}}, <16 x i32> %{{.*}}
	// CHECK: call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %{{.*}})			// CHECK: call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %{{.*}})
	return _mm512_mask_reduce_add_epi32(__M, __W);			return _mm512_mask_reduce_add_epi32(__M, __W);
	}			}

	int test_mm512_mask_reduce_mul_epi32(__mmask16 __M, __m512i __W){			int test_mm512_mask_reduce_mul_epi32(__mmask16 __M, __m512i __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_mul_epi32(			// CHECK-LABEL: @test_mm512_mask_reduce_mul_epi32(
	// CHECK: bitcast i16 %{{.*}} to <16 x i1>			// CHECK: bitcast i16 %{{.*}} to <16 x i1>
	// CHECK: select <16 x i1> %{{.}}, <16 x i32> %{{.}}, <16 x i32> %{{.*}}			// CHECK: select <16 x i1> %{{.}}, <16 x i32> %{{.}}, <16 x i32> %{{.*}}
	// CHECK: call i32 @llvm.vector.reduce.mul.v16i32(<16 x i32> %{{.*}})			// CHECK: call i32 @llvm.vector.reduce.mul.v16i32(<16 x i32> %{{.*}})
	return _mm512_mask_reduce_mul_epi32(__M, __W);			return _mm512_mask_reduce_mul_epi32(__M, __W);
	}			}

	int test_mm512_mask_reduce_and_epi32(__mmask16 __M, __m512i __W){			int test_mm512_mask_reduce_and_epi32(__mmask16 __M, __m512i __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_and_epi32(			// CHECK-LABEL: @test_mm512_mask_reduce_and_epi32(
	// CHECK: bitcast i16 %{{.*}} to <16 x i1>			// CHECK: bitcast i16 %{{.*}} to <16 x i1>
	// CHECK: select <16 x i1> %{{.}}, <16 x i32> %{{.}}, <16 x i32> %{{.*}}			// CHECK: select <16 x i1> %{{.}}, <16 x i32> %{{.}}, <16 x i32> %{{.*}}
	// CHECK: call i32 @llvm.vector.reduce.and.v16i32(<16 x i32> %{{.*}})			// CHECK: call i32 @llvm.vector.reduce.and.v16i32(<16 x i32> %{{.*}})
	return _mm512_mask_reduce_and_epi32(__M, __W);			return _mm512_mask_reduce_and_epi32(__M, __W);
	}			}

	int test_mm512_mask_reduce_or_epi32(__mmask16 __M, __m512i __W){			int test_mm512_mask_reduce_or_epi32(__mmask16 __M, __m512i __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_or_epi32(			// CHECK-LABEL: @test_mm512_mask_reduce_or_epi32(
	// CHECK: bitcast i16 %{{.*}} to <16 x i1>			// CHECK: bitcast i16 %{{.*}} to <16 x i1>
	// CHECK: select <16 x i1> %{{.}}, <16 x i32> %{{.}}, <16 x i32> %{{.*}}			// CHECK: select <16 x i1> %{{.}}, <16 x i32> %{{.}}, <16 x i32> %{{.*}}
	// CHECK: call i32 @llvm.vector.reduce.or.v16i32(<16 x i32> %{{.*}})			// CHECK: call i32 @llvm.vector.reduce.or.v16i32(<16 x i32> %{{.*}})
	return _mm512_mask_reduce_or_epi32(__M, __W);			return _mm512_mask_reduce_or_epi32(__M, __W);
	}			}

	double test_mm512_reduce_add_pd(__m512d __W){			double test_mm512_reduce_add_pd(__m512d __W, double ExtraAddOp){
	// CHECK-LABEL: @test_mm512_reduce_add_pd(			// CHECK-LABEL: @test_mm512_reduce_add_pd(
	// CHECK: call double @llvm.vector.reduce.fadd.v8f64(double 0.000000e+00, <8 x double> %{{.*}})			// CHECK-NOT: reassoc
	return _mm512_reduce_add_pd(__W);			// CHECK: call reassoc double @llvm.vector.reduce.fadd.v8f64(double -0.000000e+00, <8 x double> %{{.*}})
				// CHECK-NOT: reassoc
				return _mm512_reduce_add_pd(__W) + ExtraAddOp;
	}			}

	double test_mm512_reduce_mul_pd(__m512d __W){			double test_mm512_reduce_mul_pd(__m512d __W, double ExtraMulOp){
	// CHECK-LABEL: @test_mm512_reduce_mul_pd(			// CHECK-LABEL: @test_mm512_reduce_mul_pd(
	// CHECK: call double @llvm.vector.reduce.fmul.v8f64(double 1.000000e+00, <8 x double> %{{.*}})			// CHECK-NOT: reassoc
	return _mm512_reduce_mul_pd(__W);			// CHECK: call reassoc double @llvm.vector.reduce.fmul.v8f64(double 1.000000e+00, <8 x double> %{{.*}})
				// CHECK-NOT: reassoc
				return _mm512_reduce_mul_pd(__W) * ExtraMulOp;
	}			}

	float test_mm512_reduce_add_ps(__m512 __W){			float test_mm512_reduce_add_ps(__m512 __W){
	// CHECK-LABEL: @test_mm512_reduce_add_ps(			// CHECK-LABEL: @test_mm512_reduce_add_ps(
	// CHECK: call float @llvm.vector.reduce.fadd.v16f32(float 0.000000e+00, <16 x float> %{{.*}})			// CHECK: call reassoc float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %{{.*}})
	return _mm512_reduce_add_ps(__W);			return _mm512_reduce_add_ps(__W);
	}			}

	float test_mm512_reduce_mul_ps(__m512 __W){			float test_mm512_reduce_mul_ps(__m512 __W){
	// CHECK-LABEL: @test_mm512_reduce_mul_ps(			// CHECK-LABEL: @test_mm512_reduce_mul_ps(
	// CHECK: call float @llvm.vector.reduce.fmul.v16f32(float 1.000000e+00, <16 x float> %{{.*}})			// CHECK: call reassoc float @llvm.vector.reduce.fmul.v16f32(float 1.000000e+00, <16 x float> %{{.*}})
	return _mm512_reduce_mul_ps(__W);			return _mm512_reduce_mul_ps(__W);
	}			}

	double test_mm512_mask_reduce_add_pd(__mmask8 __M, __m512d __W){			double test_mm512_mask_reduce_add_pd(__mmask8 __M, __m512d __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_add_pd(			// CHECK-LABEL: @test_mm512_mask_reduce_add_pd(
	// CHECK: bitcast i8 %{{.*}} to <8 x i1>			// CHECK: bitcast i8 %{{.*}} to <8 x i1>
	// CHECK: select <8 x i1> %{{.}}, <8 x double> %{{.}}, <8 x double> %{{.*}}			// CHECK: select <8 x i1> %{{.}}, <8 x double> %{{.}}, <8 x double> %{{.*}}
	// CHECK: call double @llvm.vector.reduce.fadd.v8f64(double 0.000000e+00, <8 x double> %{{.*}})			// CHECK: call reassoc double @llvm.vector.reduce.fadd.v8f64(double -0.000000e+00, <8 x double> %{{.*}})
	return _mm512_mask_reduce_add_pd(__M, __W);			return _mm512_mask_reduce_add_pd(__M, __W);
	}			}

	double test_mm512_mask_reduce_mul_pd(__mmask8 __M, __m512d __W){			double test_mm512_mask_reduce_mul_pd(__mmask8 __M, __m512d __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_mul_pd(			// CHECK-LABEL: @test_mm512_mask_reduce_mul_pd(
	// CHECK: bitcast i8 %{{.*}} to <8 x i1>			// CHECK: bitcast i8 %{{.*}} to <8 x i1>
	// CHECK: select <8 x i1> %{{.}}, <8 x double> %{{.}}, <8 x double> %{{.*}}			// CHECK: select <8 x i1> %{{.}}, <8 x double> %{{.}}, <8 x double> %{{.*}}
	// CHECK: call double @llvm.vector.reduce.fmul.v8f64(double 1.000000e+00, <8 x double> %{{.*}})			// CHECK: call reassoc double @llvm.vector.reduce.fmul.v8f64(double 1.000000e+00, <8 x double> %{{.*}})
	return _mm512_mask_reduce_mul_pd(__M, __W);			return _mm512_mask_reduce_mul_pd(__M, __W);
	}			}

	float test_mm512_mask_reduce_add_ps(__mmask16 __M, __m512 __W){			float test_mm512_mask_reduce_add_ps(__mmask16 __M, __m512 __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_add_ps(			// CHECK-LABEL: @test_mm512_mask_reduce_add_ps(
	// CHECK: bitcast i16 %{{.*}} to <16 x i1>			// CHECK: bitcast i16 %{{.*}} to <16 x i1>
	// CHECK: select <16 x i1> %{{.}}, <16 x float> {{.}}, <16 x float> {{.*}}			// CHECK: select <16 x i1> %{{.}}, <16 x float> {{.}}, <16 x float> {{.*}}
	// CHECK: call float @llvm.vector.reduce.fadd.v16f32(float 0.000000e+00, <16 x float> %{{.*}})			// CHECK: call reassoc float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %{{.*}})
	return _mm512_mask_reduce_add_ps(__M, __W);			return _mm512_mask_reduce_add_ps(__M, __W);
	}			}

	float test_mm512_mask_reduce_mul_ps(__mmask16 __M, __m512 __W){			float test_mm512_mask_reduce_mul_ps(__mmask16 __M, __m512 __W){
	// CHECK-LABEL: @test_mm512_mask_reduce_mul_ps(			// CHECK-LABEL: @test_mm512_mask_reduce_mul_ps(
	// CHECK: bitcast i16 %{{.*}} to <16 x i1>			// CHECK: bitcast i16 %{{.*}} to <16 x i1>
	// CHECK: select <16 x i1> %{{.}}, <16 x float> {{.}}, <16 x float> %{{.*}}			// CHECK: select <16 x i1> %{{.}}, <16 x float> {{.}}, <16 x float> %{{.*}}
	// CHECK: call float @llvm.vector.reduce.fmul.v16f32(float 1.000000e+00, <16 x float> %{{.*}})			// CHECK: call reassoc float @llvm.vector.reduce.fmul.v16f32(float 1.000000e+00, <16 x float> %{{.*}})
	return _mm512_mask_reduce_mul_ps(__M, __W);			return _mm512_mask_reduce_mul_ps(__M, __W);
	}			}