This is an archive of the discontinued LLVM Phabricator instance.

[x86] Fix wrong prototypes for AVX mask load/store intrinsics.
ClosedPublic

Authored by andreadb on Oct 15 2015, 9:02 AM.

Download Raw Diff

Details

Reviewers

qcolombet
bruno
nadav
delena
mkuper
rnk

Commits

rG9a85b7abe088: [x86] Fix AVX maskload/store intrinsic prototypes.
rL250817: [x86] Fix AVX maskload/store intrinsic prototypes.

Summary

The llvm types used for the mask operand of AVX maskload/maskstore intrinsics are incorrect.

In particular:

The mask argument for builtin_ia32_maskloadpd and builtin_ia32_maskstorepd should be of type llvm_v2i64_ty and not llvm_v2f64_ty.
The mask argument for builtin_ia32_maskloadpd256 and builtin_ia32_maskstorepd256 should be of type llvm_v4i64_ty and not llvm_v4f64_ty.
The mask argument for builtin_ia32_maskloadps and builtin_ia32_maskstoreps should be of type llvm_v4i32_ty and not llvm_v4f32_ty.
The mask argument for builtin_ia32_maskloadps256 and builtin_ia32_maskstoreps256 should be of type llvm_v8i32_ty and not llvm_v8f32_ty.

Basically, the mask type for maskload/maskstore GCC builtins is never a vector of packed floats/doubles.

I also noticed that Clang definitions for those builtins are incorrect in BuiltinsX86.def. Also, Clang header file avxintrin.h definitions for maskload/maskstore intrinsics wrongly use packed floats/doubles instead of packed int/long for the mask operands.

For example, _mm_maskstore_pd is currently defined in avxintrin.h as:

static __inline __m256 __DEFAULT_FN_ATTRS
_mm_maskstore_pd(double *__p, __m128d __m, __m128d __a)
{
  __builtin_ia32_maskstorepd((__v2df *)__p, (__v2df)__m, (__v2df)__a);
}

According to the Intel documentation, the correct prototype for _mm_maskstore_pd should be:
_mm_maskstore_pd(double *p, m128i m, m128d __a).

So, I think the definition should be something like:

static __inline __m256 __DEFAULT_FN_ATTRS
_mm_maskstore_pd(double *__p, __m128i __m, __m128d __a)
{
  __builtin_ia32_maskstorepd((__v2df *)__p, (__v2di)__m, (__v2df)__a);
}

If you agree with this patch, I plan to send a follow-on patch (this time a Clang patch) to also fix intrinsic header file avxintrin.h (and the prototype definitions for the x86 maskload/store builtins in BuiltinsX86.def).

Please let me know if okay to submit.

-Andrea

Diff Detail

Repository: rL LLVM

Event Timeline

andreadb updated this revision to Diff 37488.Oct 15 2015, 9:02 AM

andreadb retitled this revision from to [x86] Fix wrong prototypes for AVX mask load/store intrinsics..

andreadb updated this object.

andreadb added reviewers: qcolombet, rnk, nadav, bruno, mkuper.

andreadb updated this object.

andreadb added a subscriber: llvm-commits.

Hi Andrea,

I don't recall off hand but my guess here is that the mask is a vector float type because VMASKMOV* uses two FP ports. In practice I'm not sure what the side effects of changing this are, but it might be something worth considering.

In D13776#268072, @bruno wrote:

Hi Andrea,

I don't recall off hand but my guess here is that the mask is a vector float type because VMASKMOV* uses two FP ports. In practice I'm not sure what the side effects of changing this are, but it might be something worth considering.

Hi Bruno,

You are right, VMASKMOV* are definitely floating point domain. Depending on the subtarget and whether the VMASKMOV is a load or a store it may use one or more ports.

That said, in practice the only side effect in changing those intrinsics is that we end up with an extra bitcast in the case where the mask type in input is a vector of packed floats. However, that bitcast would be equivalent to a bitconvert between types of the same register class. So, it would be no-op and no extra instructions would be generated (tested on small examples using both constant and non-constant mask values).

Since VMASKMOV is floating point domain, AMD chips would suffer for a potential stall if the Mask value originated in the integer domain. A domain crossing caused by data movement (example: VInt -> VFP) is 1cy penalty on AMD chips.
However, the domain crossing issue is not a problem that would be introduced by this change. That issue was already affecting the previous intrinsic definition (i.e. the source of this "problem" has nothing to do with this change).
The backend already knows how to mitigate this problem running the "execution dependency fix" pass.

Example:

define void @foo(<4 x float>* %dst, <4 x float> %InVec, <4 x i32> %Mask1, <4 x i32> %Mask2) {
  %xor = xor <4 x i32> %Mask1, %Mask2
  %0 = bitcast <4 x float>* %dst to i8*
  tail call void @llvm.x86.avx.maskstore.ps(i8* %0, <4 x i32> %xor, <4 x float> %InVec)
  ret void
}

In this example, ISel would select a VPXORrr for the 'xor' operation.
Before code emission, we end up with the sequence:

%XMM1<def> = VPXORrr %XMM1<kill>, %XMM2<kill>
VMASKMOVPSrm %RDI<kill>, 1, noreg, 0, %noreg, %XMM1<kill>, %XMM0<kill>

After 'exedep-fix' we have:

%XMM1<def> = VXORPSrr %XMM1<kill>, %XMM2<kill>
VMASKMOVPSrm %RDI<kill>, 1, noreg, 0, %noreg, %XMM1<kill>, %XMM0<kill>

As a developer, it comes more natural to think about the mask as a vector of integers rather than floats. For example, we may want to manipulate the mask with logical operators before passing it in input to (V)MASKMOV* instructions. Having to pass it as a float simply forces to insert an explicitly (no-op) cast.

That said, I don't know how much we care about being consistent with gcc. Gcc defines those builtins differently than us (i.e. always as vector int/long values). This "problem" was found internally when testing the codegen of intrinsic calls. We spotted this discrepancy and that's the main reason why I uploaded this patch.
At the very least I suggest that we fix our intrinsic definitions in avxintrin.h just to be consistent with what the Intel documentation says.

What do you think?

I don't really have a strong opinion on this.

I definitely agree that we should fix the definitions in avxintrin.h, though, regardless of the way the internals work.

Thanks for the detailed explanation, LGTM

andreadb mentioned this in D13861: [x86] fix wrong maskload/store intrinsic definitions in avxintrin.h (follow up of D13776)..Oct 19 2015, 5:20 AM

LGTM

This revision is now accepted and ready to land.Oct 19 2015, 6:17 AM

Closed by commit rL250817: [x86] Fix AVX maskload/store intrinsic prototypes. (authored by adibiagio). · Explain WhyOct 20 2015, 4:22 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

IR/

IntrinsicsX86.td

16 lines

test/

CodeGen/

X86/

avx-intrinsics-x86.ll

48 lines

avx-load-store.ll

4 lines

avx-win64.ll

2 lines

Diff 37853

llvm/trunk/include/llvm/IR/IntrinsicsX86.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,754 Lines • ▼ Show 20 Lines	def int_x86_avx_storeu_ps_256 : GCCBuiltin<"__builtin_ia32_storeups256">,
Intrinsic<[], [llvm_ptr_ty, llvm_v8f32_ty], [IntrReadWriteArgMem]>;		Intrinsic<[], [llvm_ptr_ty, llvm_v8f32_ty], [IntrReadWriteArgMem]>;
def int_x86_avx_storeu_dq_256 : GCCBuiltin<"__builtin_ia32_storedqu256">,		def int_x86_avx_storeu_dq_256 : GCCBuiltin<"__builtin_ia32_storedqu256">,
Intrinsic<[], [llvm_ptr_ty, llvm_v32i8_ty], [IntrReadWriteArgMem]>;		Intrinsic<[], [llvm_ptr_ty, llvm_v32i8_ty], [IntrReadWriteArgMem]>;
}		}

// Conditional load ops		// Conditional load ops
let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".		let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".
def int_x86_avx_maskload_pd : GCCBuiltin<"__builtin_ia32_maskloadpd">,		def int_x86_avx_maskload_pd : GCCBuiltin<"__builtin_ia32_maskloadpd">,
Intrinsic<[llvm_v2f64_ty], [llvm_ptr_ty, llvm_v2f64_ty],		Intrinsic<[llvm_v2f64_ty], [llvm_ptr_ty, llvm_v2i64_ty],
[IntrReadArgMem]>;		[IntrReadArgMem]>;
def int_x86_avx_maskload_ps : GCCBuiltin<"__builtin_ia32_maskloadps">,		def int_x86_avx_maskload_ps : GCCBuiltin<"__builtin_ia32_maskloadps">,
Intrinsic<[llvm_v4f32_ty], [llvm_ptr_ty, llvm_v4f32_ty],		Intrinsic<[llvm_v4f32_ty], [llvm_ptr_ty, llvm_v4i32_ty],
[IntrReadArgMem]>;		[IntrReadArgMem]>;
def int_x86_avx_maskload_pd_256 : GCCBuiltin<"__builtin_ia32_maskloadpd256">,		def int_x86_avx_maskload_pd_256 : GCCBuiltin<"__builtin_ia32_maskloadpd256">,
Intrinsic<[llvm_v4f64_ty], [llvm_ptr_ty, llvm_v4f64_ty],		Intrinsic<[llvm_v4f64_ty], [llvm_ptr_ty, llvm_v4i64_ty],
[IntrReadArgMem]>;		[IntrReadArgMem]>;
def int_x86_avx_maskload_ps_256 : GCCBuiltin<"__builtin_ia32_maskloadps256">,		def int_x86_avx_maskload_ps_256 : GCCBuiltin<"__builtin_ia32_maskloadps256">,
Intrinsic<[llvm_v8f32_ty], [llvm_ptr_ty, llvm_v8f32_ty],		Intrinsic<[llvm_v8f32_ty], [llvm_ptr_ty, llvm_v8i32_ty],
[IntrReadArgMem]>;		[IntrReadArgMem]>;
def int_x86_avx512_mask_loadu_ps_512 : GCCBuiltin<"__builtin_ia32_loadups512_mask">,		def int_x86_avx512_mask_loadu_ps_512 : GCCBuiltin<"__builtin_ia32_loadups512_mask">,
Intrinsic<[llvm_v16f32_ty], [llvm_ptr_ty, llvm_v16f32_ty, llvm_i16_ty],		Intrinsic<[llvm_v16f32_ty], [llvm_ptr_ty, llvm_v16f32_ty, llvm_i16_ty],
[IntrReadArgMem]>;		[IntrReadArgMem]>;
def int_x86_avx512_mask_loadu_pd_512 : GCCBuiltin<"__builtin_ia32_loadupd512_mask">,		def int_x86_avx512_mask_loadu_pd_512 : GCCBuiltin<"__builtin_ia32_loadupd512_mask">,
Intrinsic<[llvm_v8f64_ty], [llvm_ptr_ty, llvm_v8f64_ty, llvm_i8_ty],		Intrinsic<[llvm_v8f64_ty], [llvm_ptr_ty, llvm_v8f64_ty, llvm_i8_ty],
[IntrReadArgMem]>;		[IntrReadArgMem]>;
def int_x86_avx512_mask_load_ps_512 : GCCBuiltin<"__builtin_ia32_loadaps512_mask">,		def int_x86_avx512_mask_load_ps_512 : GCCBuiltin<"__builtin_ia32_loadaps512_mask">,
Intrinsic<[llvm_v16f32_ty], [llvm_ptr_ty, llvm_v16f32_ty, llvm_i16_ty],		Intrinsic<[llvm_v16f32_ty], [llvm_ptr_ty, llvm_v16f32_ty, llvm_i16_ty],
[IntrReadArgMem]>;		[IntrReadArgMem]>;
def int_x86_avx512_mask_load_pd_512 : GCCBuiltin<"__builtin_ia32_loadapd512_mask">,		def int_x86_avx512_mask_load_pd_512 : GCCBuiltin<"__builtin_ia32_loadapd512_mask">,
Intrinsic<[llvm_v8f64_ty], [llvm_ptr_ty, llvm_v8f64_ty, llvm_i8_ty],		Intrinsic<[llvm_v8f64_ty], [llvm_ptr_ty, llvm_v8f64_ty, llvm_i8_ty],
[IntrReadArgMem]>;		[IntrReadArgMem]>;
}		}

// Conditional store ops		// Conditional store ops
let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".		let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".
def int_x86_avx_maskstore_pd : GCCBuiltin<"__builtin_ia32_maskstorepd">,		def int_x86_avx_maskstore_pd : GCCBuiltin<"__builtin_ia32_maskstorepd">,
Intrinsic<[], [llvm_ptr_ty,		Intrinsic<[], [llvm_ptr_ty,
llvm_v2f64_ty, llvm_v2f64_ty], [IntrReadWriteArgMem]>;		llvm_v2i64_ty, llvm_v2f64_ty], [IntrReadWriteArgMem]>;
def int_x86_avx_maskstore_ps : GCCBuiltin<"__builtin_ia32_maskstoreps">,		def int_x86_avx_maskstore_ps : GCCBuiltin<"__builtin_ia32_maskstoreps">,
Intrinsic<[], [llvm_ptr_ty,		Intrinsic<[], [llvm_ptr_ty,
llvm_v4f32_ty, llvm_v4f32_ty], [IntrReadWriteArgMem]>;		llvm_v4i32_ty, llvm_v4f32_ty], [IntrReadWriteArgMem]>;
def int_x86_avx_maskstore_pd_256 :		def int_x86_avx_maskstore_pd_256 :
GCCBuiltin<"__builtin_ia32_maskstorepd256">,		GCCBuiltin<"__builtin_ia32_maskstorepd256">,
Intrinsic<[], [llvm_ptr_ty,		Intrinsic<[], [llvm_ptr_ty,
llvm_v4f64_ty, llvm_v4f64_ty], [IntrReadWriteArgMem]>;		llvm_v4i64_ty, llvm_v4f64_ty], [IntrReadWriteArgMem]>;
def int_x86_avx_maskstore_ps_256 :		def int_x86_avx_maskstore_ps_256 :
GCCBuiltin<"__builtin_ia32_maskstoreps256">,		GCCBuiltin<"__builtin_ia32_maskstoreps256">,
Intrinsic<[], [llvm_ptr_ty,		Intrinsic<[], [llvm_ptr_ty,
llvm_v8f32_ty, llvm_v8f32_ty], [IntrReadWriteArgMem]>;		llvm_v8i32_ty, llvm_v8f32_ty], [IntrReadWriteArgMem]>;
def int_x86_avx512_mask_storeu_ps_512 :		def int_x86_avx512_mask_storeu_ps_512 :
GCCBuiltin<"__builtin_ia32_storeups512_mask">,		GCCBuiltin<"__builtin_ia32_storeups512_mask">,
Intrinsic<[], [llvm_ptr_ty, llvm_v16f32_ty, llvm_i16_ty],		Intrinsic<[], [llvm_ptr_ty, llvm_v16f32_ty, llvm_i16_ty],
[IntrReadWriteArgMem]>;		[IntrReadWriteArgMem]>;
def int_x86_avx512_mask_storeu_pd_512 :		def int_x86_avx512_mask_storeu_pd_512 :
GCCBuiltin<"__builtin_ia32_storeupd512_mask">,		GCCBuiltin<"__builtin_ia32_storeupd512_mask">,
Intrinsic<[], [llvm_ptr_ty, llvm_v8f64_ty, llvm_i8_ty],		Intrinsic<[], [llvm_ptr_ty, llvm_v8f64_ty, llvm_i8_ty],
[IntrReadWriteArgMem]>;		[IntrReadWriteArgMem]>;
▲ Show 20 Lines • Show All 5,383 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/avx-intrinsics-x86.ll

	Show First 20 Lines • Show All 2,530 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: vlddqu (%eax), %ymm0			; CHECK-NEXT: vlddqu (%eax), %ymm0
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	%res = call <32 x i8> @llvm.x86.avx.ldu.dq.256(i8* %a0) ; <<32 x i8>> [#uses=1]			%res = call <32 x i8> @llvm.x86.avx.ldu.dq.256(i8* %a0) ; <<32 x i8>> [#uses=1]
	ret <32 x i8> %res			ret <32 x i8> %res
	}			}
	declare <32 x i8> @llvm.x86.avx.ldu.dq.256(i8*) nounwind readonly			declare <32 x i8> @llvm.x86.avx.ldu.dq.256(i8*) nounwind readonly


	define <2 x double> @test_x86_avx_maskload_pd(i8* %a0, <2 x double> %a1) {			define <2 x double> @test_x86_avx_maskload_pd(i8* %a0, <2 x i64> %mask) {
	; CHECK-LABEL: test_x86_avx_maskload_pd:			; CHECK-LABEL: test_x86_avx_maskload_pd:
	; CHECK: ## BB#0:			; CHECK: ## BB#0:
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax			; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax
	; CHECK-NEXT: vmaskmovpd (%eax), %xmm0, %xmm0			; CHECK-NEXT: vmaskmovpd (%eax), %xmm0, %xmm0
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	%res = call <2 x double> @llvm.x86.avx.maskload.pd(i8* %a0, <2 x double> %a1) ; <<2 x double>> [#uses=1]			%res = call <2 x double> @llvm.x86.avx.maskload.pd(i8* %a0, <2 x i64> %mask) ; <<2 x double>> [#uses=1]
	ret <2 x double> %res			ret <2 x double> %res
	}			}
	declare <2 x double> @llvm.x86.avx.maskload.pd(i8*, <2 x double>) nounwind readonly			declare <2 x double> @llvm.x86.avx.maskload.pd(i8*, <2 x i64>) nounwind readonly


	define <4 x double> @test_x86_avx_maskload_pd_256(i8* %a0, <4 x double> %a1) {			define <4 x double> @test_x86_avx_maskload_pd_256(i8* %a0, <4 x i64> %mask) {
	; CHECK-LABEL: test_x86_avx_maskload_pd_256:			; CHECK-LABEL: test_x86_avx_maskload_pd_256:
	; CHECK: ## BB#0:			; CHECK: ## BB#0:
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax			; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax
	; CHECK-NEXT: vmaskmovpd (%eax), %ymm0, %ymm0			; CHECK-NEXT: vmaskmovpd (%eax), %ymm0, %ymm0
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	%res = call <4 x double> @llvm.x86.avx.maskload.pd.256(i8* %a0, <4 x double> %a1) ; <<4 x double>> [#uses=1]			%res = call <4 x double> @llvm.x86.avx.maskload.pd.256(i8* %a0, <4 x i64> %mask) ; <<4 x double>> [#uses=1]
	ret <4 x double> %res			ret <4 x double> %res
	}			}
	declare <4 x double> @llvm.x86.avx.maskload.pd.256(i8*, <4 x double>) nounwind readonly			declare <4 x double> @llvm.x86.avx.maskload.pd.256(i8*, <4 x i64>) nounwind readonly


	define <4 x float> @test_x86_avx_maskload_ps(i8* %a0, <4 x float> %a1) {			define <4 x float> @test_x86_avx_maskload_ps(i8* %a0, <4 x i32> %mask) {
	; CHECK-LABEL: test_x86_avx_maskload_ps:			; CHECK-LABEL: test_x86_avx_maskload_ps:
	; CHECK: ## BB#0:			; CHECK: ## BB#0:
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax			; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax
	; CHECK-NEXT: vmaskmovps (%eax), %xmm0, %xmm0			; CHECK-NEXT: vmaskmovps (%eax), %xmm0, %xmm0
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	%res = call <4 x float> @llvm.x86.avx.maskload.ps(i8* %a0, <4 x float> %a1) ; <<4 x float>> [#uses=1]			%res = call <4 x float> @llvm.x86.avx.maskload.ps(i8* %a0, <4 x i32> %mask) ; <<4 x float>> [#uses=1]
	ret <4 x float> %res			ret <4 x float> %res
	}			}
	declare <4 x float> @llvm.x86.avx.maskload.ps(i8*, <4 x float>) nounwind readonly			declare <4 x float> @llvm.x86.avx.maskload.ps(i8*, <4 x i32>) nounwind readonly


	define <8 x float> @test_x86_avx_maskload_ps_256(i8* %a0, <8 x float> %a1) {			define <8 x float> @test_x86_avx_maskload_ps_256(i8* %a0, <8 x i32> %mask) {
	; CHECK-LABEL: test_x86_avx_maskload_ps_256:			; CHECK-LABEL: test_x86_avx_maskload_ps_256:
	; CHECK: ## BB#0:			; CHECK: ## BB#0:
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax			; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax
	; CHECK-NEXT: vmaskmovps (%eax), %ymm0, %ymm0			; CHECK-NEXT: vmaskmovps (%eax), %ymm0, %ymm0
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	%res = call <8 x float> @llvm.x86.avx.maskload.ps.256(i8* %a0, <8 x float> %a1) ; <<8 x float>> [#uses=1]			%res = call <8 x float> @llvm.x86.avx.maskload.ps.256(i8* %a0, <8 x i32> %mask) ; <<8 x float>> [#uses=1]
	ret <8 x float> %res			ret <8 x float> %res
	}			}
	declare <8 x float> @llvm.x86.avx.maskload.ps.256(i8*, <8 x float>) nounwind readonly			declare <8 x float> @llvm.x86.avx.maskload.ps.256(i8*, <8 x i32>) nounwind readonly


	define void @test_x86_avx_maskstore_pd(i8* %a0, <2 x double> %a1, <2 x double> %a2) {			define void @test_x86_avx_maskstore_pd(i8* %a0, <2 x i64> %mask, <2 x double> %a2) {
	; CHECK-LABEL: test_x86_avx_maskstore_pd:			; CHECK-LABEL: test_x86_avx_maskstore_pd:
	; CHECK: ## BB#0:			; CHECK: ## BB#0:
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax			; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax
	; CHECK-NEXT: vmaskmovpd %xmm1, %xmm0, (%eax)			; CHECK-NEXT: vmaskmovpd %xmm1, %xmm0, (%eax)
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	call void @llvm.x86.avx.maskstore.pd(i8* %a0, <2 x double> %a1, <2 x double> %a2)			call void @llvm.x86.avx.maskstore.pd(i8* %a0, <2 x i64> %mask, <2 x double> %a2)
	ret void			ret void
	}			}
	declare void @llvm.x86.avx.maskstore.pd(i8*, <2 x double>, <2 x double>) nounwind			declare void @llvm.x86.avx.maskstore.pd(i8*, <2 x i64>, <2 x double>) nounwind


	define void @test_x86_avx_maskstore_pd_256(i8* %a0, <4 x double> %a1, <4 x double> %a2) {			define void @test_x86_avx_maskstore_pd_256(i8* %a0, <4 x i64> %mask, <4 x double> %a2) {
	; CHECK-LABEL: test_x86_avx_maskstore_pd_256:			; CHECK-LABEL: test_x86_avx_maskstore_pd_256:
	; CHECK: ## BB#0:			; CHECK: ## BB#0:
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax			; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax
	; CHECK-NEXT: vmaskmovpd %ymm1, %ymm0, (%eax)			; CHECK-NEXT: vmaskmovpd %ymm1, %ymm0, (%eax)
	; CHECK-NEXT: vzeroupper			; CHECK-NEXT: vzeroupper
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	call void @llvm.x86.avx.maskstore.pd.256(i8* %a0, <4 x double> %a1, <4 x double> %a2)			call void @llvm.x86.avx.maskstore.pd.256(i8* %a0, <4 x i64> %mask, <4 x double> %a2)
	ret void			ret void
	}			}
	declare void @llvm.x86.avx.maskstore.pd.256(i8*, <4 x double>, <4 x double>) nounwind			declare void @llvm.x86.avx.maskstore.pd.256(i8*, <4 x i64>, <4 x double>) nounwind


	define void @test_x86_avx_maskstore_ps(i8* %a0, <4 x float> %a1, <4 x float> %a2) {			define void @test_x86_avx_maskstore_ps(i8* %a0, <4 x i32> %mask, <4 x float> %a2) {
	; CHECK-LABEL: test_x86_avx_maskstore_ps:			; CHECK-LABEL: test_x86_avx_maskstore_ps:
	; CHECK: ## BB#0:			; CHECK: ## BB#0:
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax			; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax
	; CHECK-NEXT: vmaskmovps %xmm1, %xmm0, (%eax)			; CHECK-NEXT: vmaskmovps %xmm1, %xmm0, (%eax)
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	call void @llvm.x86.avx.maskstore.ps(i8* %a0, <4 x float> %a1, <4 x float> %a2)			call void @llvm.x86.avx.maskstore.ps(i8* %a0, <4 x i32> %mask, <4 x float> %a2)
	ret void			ret void
	}			}
	declare void @llvm.x86.avx.maskstore.ps(i8*, <4 x float>, <4 x float>) nounwind			declare void @llvm.x86.avx.maskstore.ps(i8*, <4 x i32>, <4 x float>) nounwind


	define void @test_x86_avx_maskstore_ps_256(i8* %a0, <8 x float> %a1, <8 x float> %a2) {			define void @test_x86_avx_maskstore_ps_256(i8* %a0, <8 x i32> %mask, <8 x float> %a2) {
	; CHECK-LABEL: test_x86_avx_maskstore_ps_256:			; CHECK-LABEL: test_x86_avx_maskstore_ps_256:
	; CHECK: ## BB#0:			; CHECK: ## BB#0:
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax			; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax
	; CHECK-NEXT: vmaskmovps %ymm1, %ymm0, (%eax)			; CHECK-NEXT: vmaskmovps %ymm1, %ymm0, (%eax)
	; CHECK-NEXT: vzeroupper			; CHECK-NEXT: vzeroupper
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	call void @llvm.x86.avx.maskstore.ps.256(i8* %a0, <8 x float> %a1, <8 x float> %a2)			call void @llvm.x86.avx.maskstore.ps.256(i8* %a0, <8 x i32> %mask, <8 x float> %a2)
	ret void			ret void
	}			}
	declare void @llvm.x86.avx.maskstore.ps.256(i8*, <8 x float>, <8 x float>) nounwind			declare void @llvm.x86.avx.maskstore.ps.256(i8*, <8 x i32>, <8 x float>) nounwind


	define <4 x double> @test_x86_avx_max_pd_256(<4 x double> %a0, <4 x double> %a1) {			define <4 x double> @test_x86_avx_max_pd_256(<4 x double> %a0, <4 x double> %a1) {
	; CHECK-LABEL: test_x86_avx_max_pd_256:			; CHECK-LABEL: test_x86_avx_max_pd_256:
	; CHECK: ## BB#0:			; CHECK: ## BB#0:
	; CHECK-NEXT: vmaxpd %ymm1, %ymm0, %ymm0			; CHECK-NEXT: vmaxpd %ymm1, %ymm0, %ymm0
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	%res = call <4 x double> @llvm.x86.avx.max.pd.256(<4 x double> %a0, <4 x double> %a1) ; <<4 x double>> [#uses=1]			%res = call <4 x double> @llvm.x86.avx.max.pd.256(<4 x double> %a0, <4 x double> %a1) ; <<4 x double>> [#uses=1]
	▲ Show 20 Lines • Show All 722 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/avx-load-store.ll

	Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines
	; CHECK: vmovaps %xmm			; CHECK: vmovaps %xmm
	define void @double_save(<4 x i32> %A, <4 x i32> %B, <8 x i32>* %P) nounwind ssp {			define void @double_save(<4 x i32> %A, <4 x i32> %B, <8 x i32>* %P) nounwind ssp {
	entry:			entry:
	%Z = shufflevector <4 x i32>%A, <4 x i32>%B, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			%Z = shufflevector <4 x i32>%A, <4 x i32>%B, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	store <8 x i32> %Z, <8 x i32>* %P, align 16			store <8 x i32> %Z, <8 x i32>* %P, align 16
	ret void			ret void
	}			}

	declare void @llvm.x86.avx.maskstore.ps.256(i8*, <8 x float>, <8 x float>) nounwind			declare void @llvm.x86.avx.maskstore.ps.256(i8*, <8 x i32>, <8 x float>) nounwind

	; CHECK_O0: _f_f			; CHECK_O0: _f_f
	; CHECK-O0: vmovss LCPI			; CHECK-O0: vmovss LCPI
	; CHECK-O0: vxorps %xmm			; CHECK-O0: vxorps %xmm
	; CHECK-O0: vmovss %xmm			; CHECK-O0: vmovss %xmm
	define void @f_f() nounwind {			define void @f_f() nounwind {
	allocas:			allocas:
	br i1 undef, label %cif_mask_all, label %cif_mask_mixed			br i1 undef, label %cif_mask_all, label %cif_mask_mixed

	cif_mask_all: ; preds = %allocas			cif_mask_all: ; preds = %allocas
	unreachable			unreachable

	cif_mask_mixed: ; preds = %allocas			cif_mask_mixed: ; preds = %allocas
	br i1 undef, label %cif_mixed_test_all, label %cif_mixed_test_any_check			br i1 undef, label %cif_mixed_test_all, label %cif_mixed_test_any_check

	cif_mixed_test_all: ; preds = %cif_mask_mixed			cif_mixed_test_all: ; preds = %cif_mask_mixed
	call void @llvm.x86.avx.maskstore.ps.256(i8* undef, <8 x float> <float 0xFFFFFFFFE0000000, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00>, <8 x float> undef) nounwind			call void @llvm.x86.avx.maskstore.ps.256(i8* undef, <8 x i32> <i32 -1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, <8 x float> undef) nounwind
	unreachable			unreachable

	cif_mixed_test_any_check: ; preds = %cif_mask_mixed			cif_mixed_test_any_check: ; preds = %cif_mask_mixed
	unreachable			unreachable
	}			}

	; CHECK: add8i32			; CHECK: add8i32
	; CHECK: vmovups			; CHECK: vmovups
	Show All 33 Lines

llvm/trunk/test/CodeGen/X86/avx-win64.ll

Show All 36 Lines	safe_if_run_false: ; preds = %allocas
br label %safe_if_after_false		br label %safe_if_after_false

safe_if_after_false: ; preds = %safe_if_run_false, %allocas		safe_if_after_false: ; preds = %safe_if_run_false, %allocas
%0 = phi <8 x float> [ %x, %allocas ], [ %blend.i52, %safe_if_run_false ]		%0 = phi <8 x float> [ %x, %allocas ], [ %blend.i52, %safe_if_run_false ]
ret <8 x float> %0		ret <8 x float> %0
}		}

declare i32 @llvm.x86.avx.movmsk.ps.256(<8 x float>) nounwind readnone		declare i32 @llvm.x86.avx.movmsk.ps.256(<8 x float>) nounwind readnone
declare <8 x float> @llvm.x86.avx.maskload.ps.256(i8*, <8 x float>) nounwind readonly
declare void @llvm.x86.avx.maskstore.ps.256(i8*, <8 x float>, <8 x float>) nounwind
declare <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>, <8 x float>, <8 x float>) nounwind readnone		declare <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>, <8 x float>, <8 x float>) nounwind readnone