This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
merge-consecutive-stores-nt.ll
-
nontemporal-3.ll

Differential D63246

[X86][SSE] Prevent misaligned non-temporal vector load/store combines
ClosedPublic

Authored by RKSimon on Jun 13 2019, 3:02 AM.

Download Raw Diff

Details

Reviewers

craig.topper
andreadb
wristow
lebedev.ri

Commits

rG454e6b9010fe: [X86][SSE] Prevent misaligned non-temporal vector load/store combines
rL363564: [X86][SSE] Prevent misaligned non-temporal vector load/store combines

Summary

For loads, pre-SSE41 we can't perform NT loads at all, and after that we can only perform vector aligned loads so if the alignment is less than for a xmm we'll just end up using the regular unaligned vector loads anyway.

First step towards fixing PR42026 - the next step for stores will be to use SSE4A movntsd where possible and to avoid the stack spill on SSE2 targets.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon created this revision.Jun 13 2019, 3:02 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2019, 3:02 AM

RKSimon mentioned this in rL363247: [X86][SSE] Add SSE4A nt store tests on X86 as well as X64.Jun 13 2019, 3:27 AM

RKSimon mentioned this in rGe1aea8589683: [X86][SSE] Add SSE4A nt store tests on X86 as well as X64.

lebedev.ri added inline comments.Jun 14 2019, 4:08 PM

lib/Target/X86/X86ISelLowering.cpp
2133–2136 ↗	(On Diff #204461)	So this says that if this is a non-temporal load of a vector, either from a pointer that is aligned so little that we can't even make 128-bit aligned load from, or we do not even have SSE41 (so no aligned non-temporal vector loads at all), the underaligned loading is allowed, correct? This kinda looks backwards to me? I expected something like if (!!(Flags & MachineMemOperand::MONonTemporal) && VT.isVector()) { return !!(Flags & MachineMemOperand::MOLoad) && Align >= VT.getSizeInBytes() && Subtarget.hasSSE41()); }

Also, judging by the LHS of the diffs, there isn't anything to split those loads/stores currently?

In D63246#1544547, @lebedev.ri wrote:

Also, judging by the LHS of the diffs, there isn't anything to split those loads/stores currently?

I'm adding more tests as I can but need to get correct allowsMisalignedMemoryAccesses handling in before I can improve combineLoad and combienStore - otherwise the existing load/store merge combines will fight against us (e.g. nt-load splitting fails horribly....).

lib/Target/X86/X86ISelLowering.cpp
2133–2136 ↗	(On Diff #204461)	What I have looks correct to me - we need to return true if the misaligned load is acceptable - so if align<16 or we don't have SSE41 - but I will pull out more of the if() logic into the return. Also, allowsMisalignedMemoryAccesses won't have been called if Align >= VT.getSizeInBytes().

cleaned up nt-load if() logic - rebased with nontemporal-3.ll tests

lebedev.ri marked an inline comment as done.Jun 17 2019, 5:49 AM

lebedev.ri added inline comments.

lib/Target/X86/X86ISelLowering.cpp
2133–2136 ↗	(On Diff #204461)	Thinking about it more, yeah, the tests seem to agree that it works as intended, even though the `allowsMisalignedMemoryAccesses()` looks backwards.

Looks good to me.

test/CodeGen/X86/nontemporal-3.ll
388–393 ↗	(On Diff #205050)	This SSE sequence is clearly sub-optimal. That being said, I am not too worried about it given how unlucky this scenario is in practice. If possible, it would be nice to have it fixed in a follow-up patch. Basically, there is no reason why we should zero XMM0 to then store it on the stack... to then reload its elements on GPRs.. We should just zero a GPR and then have both MOVNTI use it. I suspect this has to do with how we lower certain nodes on SSE.

This revision is now accepted and ready to land.Jun 17 2019, 6:05 AM

Closed by commit rL363564: [X86][SSE] Prevent misaligned non-temporal vector load/store combines (authored by RKSimon). · Explain WhyJun 17 2019, 7:24 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

17 lines

test/

CodeGen/

X86/

merge-consecutive-stores-nt.ll

152 lines

nontemporal-3.ll

660 lines

Diff 205066

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 2,100 Lines • ▼ Show 20 Lines
	bool X86TargetLowering::isSafeMemOpType(MVT VT) const {			bool X86TargetLowering::isSafeMemOpType(MVT VT) const {
	if (VT == MVT::f32)			if (VT == MVT::f32)
	return X86ScalarSSEf32;			return X86ScalarSSEf32;
	else if (VT == MVT::f64)			else if (VT == MVT::f64)
	return X86ScalarSSEf64;			return X86ScalarSSEf64;
	return true;			return true;
	}			}

	bool X86TargetLowering::allowsMisalignedMemoryAccesses(EVT VT, unsigned,			bool X86TargetLowering::allowsMisalignedMemoryAccesses(
	unsigned,			EVT VT, unsigned, unsigned Align, MachineMemOperand::Flags Flags,
	MachineMemOperand::Flags,
	bool *Fast) const {			bool *Fast) const {
	if (Fast) {			if (Fast) {
	switch (VT.getSizeInBits()) {			switch (VT.getSizeInBits()) {
	default:			default:
	// 8-byte and under are always assumed to be fast.			// 8-byte and under are always assumed to be fast.
	*Fast = true;			*Fast = true;
	break;			break;
	case 128:			case 128:
	*Fast = !Subtarget.isUnalignedMem16Slow();			*Fast = !Subtarget.isUnalignedMem16Slow();
	break;			break;
	case 256:			case 256:
	*Fast = !Subtarget.isUnalignedMem32Slow();			*Fast = !Subtarget.isUnalignedMem32Slow();
	break;			break;
	// TODO: What about AVX-512 (512-bit) accesses?			// TODO: What about AVX-512 (512-bit) accesses?
	}			}
	}			}
				// NonTemporal vector memory ops must be aligned.
				if (!!(Flags & MachineMemOperand::MONonTemporal) && VT.isVector()) {
				// NT loads can only be vector aligned, so if its less aligned than the
				// minimum vector size (which we can split the vector down to), we might as
				// well use a regular unaligned vector load.
				// We don't have any NT loads pre-SSE41.
				if (!!(Flags & MachineMemOperand::MOLoad))
				return (Align < 16 \|\| !Subtarget.hasSSE41());
				return false;
				}
	// Misaligned accesses of any size are always allowed.			// Misaligned accesses of any size are always allowed.
	return true;			return true;
	}			}

	/// Return the entry encoding for a jump table in the			/// Return the entry encoding for a jump table in the
	/// current function. The returned value is a member of the			/// current function. The returned value is a member of the
	/// MachineJumpTableInfo::JTEntryKind enum.			/// MachineJumpTableInfo::JTEntryKind enum.
	unsigned X86TargetLowering::getJumpTableEncoding() const {			unsigned X86TargetLowering::getJumpTableEncoding() const {
	▲ Show 20 Lines • Show All 42,792 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/merge-consecutive-stores-nt.ll

Show First 20 Lines • Show All 157 Lines • ▼ Show 20 Lines	; X64-AVX-NEXT: retq
%4 = load <4 x float>, <4 x float>* %2, align 16		%4 = load <4 x float>, <4 x float>* %2, align 16
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 32, !nontemporal !0		store <4 x float> %3, <4 x float>* %a1, align 32, !nontemporal !0
store <4 x float> %4, <4 x float>* %6, align 16		store <4 x float> %4, <4 x float>* %6, align 16
ret void		ret void
}		}

; FIXME: AVX2 can't perform NT-load-ymm on 16-byte aligned memory.		; AVX2 can't perform NT-load-ymm on 16-byte aligned memory.
; Must be kept seperate as VMOVNTDQA xmm.		; Must be kept seperate as VMOVNTDQA xmm.
define void @merge_2_v4f32_align16_ntload(<4 x float>* %a0, <4 x float>* %a1) nounwind {		define void @merge_2_v4f32_align16_ntload(<4 x float>* %a0, <4 x float>* %a1) nounwind {
; X86-LABEL: merge_2_v4f32_align16_ntload:		; X86-LABEL: merge_2_v4f32_align16_ntload:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-NEXT: movaps (%ecx), %xmm0		; X86-NEXT: movaps (%ecx), %xmm0
; X86-NEXT: movaps 16(%ecx), %xmm1		; X86-NEXT: movaps 16(%ecx), %xmm1
Show All 20 Lines
; X64-SSE41-LABEL: merge_2_v4f32_align16_ntload:		; X64-SSE41-LABEL: merge_2_v4f32_align16_ntload:
; X64-SSE41: # %bb.0:		; X64-SSE41: # %bb.0:
; X64-SSE41-NEXT: movntdqa (%rdi), %xmm0		; X64-SSE41-NEXT: movntdqa (%rdi), %xmm0
; X64-SSE41-NEXT: movntdqa 16(%rdi), %xmm1		; X64-SSE41-NEXT: movntdqa 16(%rdi), %xmm1
; X64-SSE41-NEXT: movdqa %xmm0, (%rsi)		; X64-SSE41-NEXT: movdqa %xmm0, (%rsi)
; X64-SSE41-NEXT: movdqa %xmm1, 16(%rsi)		; X64-SSE41-NEXT: movdqa %xmm1, 16(%rsi)
; X64-SSE41-NEXT: retq		; X64-SSE41-NEXT: retq
;		;
; X64-AVX1-LABEL: merge_2_v4f32_align16_ntload:		; X64-AVX-LABEL: merge_2_v4f32_align16_ntload:
; X64-AVX1: # %bb.0:		; X64-AVX: # %bb.0:
; X64-AVX1-NEXT: vmovntdqa (%rdi), %xmm0		; X64-AVX-NEXT: vmovntdqa (%rdi), %xmm0
; X64-AVX1-NEXT: vmovntdqa 16(%rdi), %xmm1		; X64-AVX-NEXT: vmovntdqa 16(%rdi), %xmm1
; X64-AVX1-NEXT: vmovdqa %xmm1, 16(%rsi)		; X64-AVX-NEXT: vmovdqa %xmm0, (%rsi)
; X64-AVX1-NEXT: vmovdqa %xmm0, (%rsi)		; X64-AVX-NEXT: vmovdqa %xmm1, 16(%rsi)
; X64-AVX1-NEXT: retq		; X64-AVX-NEXT: retq
;
; X64-AVX2-LABEL: merge_2_v4f32_align16_ntload:
; X64-AVX2: # %bb.0:
; X64-AVX2-NEXT: vmovups (%rdi), %ymm0
; X64-AVX2-NEXT: vmovups %ymm0, (%rsi)
; X64-AVX2-NEXT: vzeroupper
; X64-AVX2-NEXT: retq
%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0		%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0
%2 = bitcast float* %1 to <4 x float>*		%2 = bitcast float* %1 to <4 x float>*
%3 = load <4 x float>, <4 x float>* %a0, align 16, !nontemporal !0		%3 = load <4 x float>, <4 x float>* %a0, align 16, !nontemporal !0
%4 = load <4 x float>, <4 x float>* %2, align 16, !nontemporal !0		%4 = load <4 x float>, <4 x float>* %2, align 16, !nontemporal !0
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 16		store <4 x float> %3, <4 x float>* %a1, align 16
store <4 x float> %4, <4 x float>* %6, align 16		store <4 x float> %4, <4 x float>* %6, align 16
ret void		ret void
}		}

; FIXME: AVX can't perform NT-store-ymm on 16-byte aligned memory.		; AVX can't perform NT-store-ymm on 16-byte aligned memory.
; Must be kept seperate as VMOVNTPS xmm.		; Must be kept seperate as VMOVNTPS xmm.
define void @merge_2_v4f32_align16_ntstore(<4 x float>* %a0, <4 x float>* %a1) nounwind {		define void @merge_2_v4f32_align16_ntstore(<4 x float>* %a0, <4 x float>* %a1) nounwind {
; X86-LABEL: merge_2_v4f32_align16_ntstore:		; X86-LABEL: merge_2_v4f32_align16_ntstore:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-NEXT: movaps (%ecx), %xmm0		; X86-NEXT: movaps (%ecx), %xmm0
; X86-NEXT: movaps 16(%ecx), %xmm1		; X86-NEXT: movaps 16(%ecx), %xmm1
; X86-NEXT: movntps %xmm0, (%eax)		; X86-NEXT: movntps %xmm0, (%eax)
; X86-NEXT: movntps %xmm1, 16(%eax)		; X86-NEXT: movntps %xmm1, 16(%eax)
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-SSE-LABEL: merge_2_v4f32_align16_ntstore:		; X64-SSE-LABEL: merge_2_v4f32_align16_ntstore:
; X64-SSE: # %bb.0:		; X64-SSE: # %bb.0:
; X64-SSE-NEXT: movaps (%rdi), %xmm0		; X64-SSE-NEXT: movaps (%rdi), %xmm0
; X64-SSE-NEXT: movaps 16(%rdi), %xmm1		; X64-SSE-NEXT: movaps 16(%rdi), %xmm1
; X64-SSE-NEXT: movntps %xmm0, (%rsi)		; X64-SSE-NEXT: movntps %xmm0, (%rsi)
; X64-SSE-NEXT: movntps %xmm1, 16(%rsi)		; X64-SSE-NEXT: movntps %xmm1, 16(%rsi)
; X64-SSE-NEXT: retq		; X64-SSE-NEXT: retq
;		;
; X64-AVX-LABEL: merge_2_v4f32_align16_ntstore:		; X64-AVX-LABEL: merge_2_v4f32_align16_ntstore:
; X64-AVX: # %bb.0:		; X64-AVX: # %bb.0:
; X64-AVX-NEXT: vmovups (%rdi), %ymm0		; X64-AVX-NEXT: vmovaps (%rdi), %xmm0
; X64-AVX-NEXT: vmovups %ymm0, (%rsi)		; X64-AVX-NEXT: vmovaps 16(%rdi), %xmm1
; X64-AVX-NEXT: vzeroupper		; X64-AVX-NEXT: vmovntps %xmm0, (%rsi)
		; X64-AVX-NEXT: vmovntps %xmm1, 16(%rsi)
; X64-AVX-NEXT: retq		; X64-AVX-NEXT: retq
%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0		%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0
%2 = bitcast float* %1 to <4 x float>*		%2 = bitcast float* %1 to <4 x float>*
%3 = load <4 x float>, <4 x float>* %a0, align 16		%3 = load <4 x float>, <4 x float>* %a0, align 16
%4 = load <4 x float>, <4 x float>* %2, align 16		%4 = load <4 x float>, <4 x float>* %2, align 16
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 16, !nontemporal !0		store <4 x float> %3, <4 x float>* %a1, align 16, !nontemporal !0
store <4 x float> %4, <4 x float>* %6, align 16, !nontemporal !0		store <4 x float> %4, <4 x float>* %6, align 16, !nontemporal !0
ret void		ret void
}		}

; FIXME: Nothing can perform NT-load-vector on 1-byte aligned memory.		; Nothing can perform NT-load-vector on 1-byte aligned memory.
; Just perform regular loads.		; Just perform regular loads.
define void @merge_2_v4f32_align1_ntload(<4 x float>* %a0, <4 x float>* %a1) nounwind {		define void @merge_2_v4f32_align1_ntload(<4 x float>* %a0, <4 x float>* %a1) nounwind {
; X86-LABEL: merge_2_v4f32_align1_ntload:		; X86-LABEL: merge_2_v4f32_align1_ntload:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-NEXT: movups (%ecx), %xmm0		; X86-NEXT: movups (%ecx), %xmm0
; X86-NEXT: movups 16(%ecx), %xmm1		; X86-NEXT: movups 16(%ecx), %xmm1
Show All 21 Lines	; X64-AVX-NEXT: retq
%4 = load <4 x float>, <4 x float>* %2, align 1, !nontemporal !0		%4 = load <4 x float>, <4 x float>* %2, align 1, !nontemporal !0
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 1		store <4 x float> %3, <4 x float>* %a1, align 1
store <4 x float> %4, <4 x float>* %6, align 1		store <4 x float> %4, <4 x float>* %6, align 1
ret void		ret void
}		}

; FIXME: Nothing can perform NT-store-vector on 1-byte aligned memory.		; Nothing can perform NT-store-vector on 1-byte aligned memory.
; Must be scalarized to use MOVTNI/MOVNTSD.		; Must be scalarized to use MOVTNI/MOVNTSD.
define void @merge_2_v4f32_align1_ntstore(<4 x float>* %a0, <4 x float>* %a1) nounwind {		define void @merge_2_v4f32_align1_ntstore(<4 x float>* %a0, <4 x float>* %a1) nounwind {
; X86-LABEL: merge_2_v4f32_align1_ntstore:		; X86-LABEL: merge_2_v4f32_align1_ntstore:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: pushl %ebp
; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-NEXT: movl %esp, %ebp
		; X86-NEXT: andl $-16, %esp
		; X86-NEXT: subl $48, %esp
		; X86-NEXT: movl 12(%ebp), %eax
		; X86-NEXT: movl 8(%ebp), %ecx
; X86-NEXT: movups (%ecx), %xmm0		; X86-NEXT: movups (%ecx), %xmm0
; X86-NEXT: movups 16(%ecx), %xmm1		; X86-NEXT: movups 16(%ecx), %xmm1
; X86-NEXT: movups %xmm0, (%eax)		; X86-NEXT: movaps %xmm0, {{[0-9]+}}(%esp)
; X86-NEXT: movups %xmm1, 16(%eax)		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 12(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 8(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movl {{[0-9]+}}(%esp), %edx
		; X86-NEXT: movntil %edx, 4(%eax)
		; X86-NEXT: movntil %ecx, (%eax)
		; X86-NEXT: movaps %xmm1, (%esp)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 28(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 24(%eax)
		; X86-NEXT: movl (%esp), %ecx
		; X86-NEXT: movl {{[0-9]+}}(%esp), %edx
		; X86-NEXT: movntil %edx, 20(%eax)
		; X86-NEXT: movntil %ecx, 16(%eax)
		; X86-NEXT: movl %ebp, %esp
		; X86-NEXT: popl %ebp
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-SSE-LABEL: merge_2_v4f32_align1_ntstore:		; X64-SSE-LABEL: merge_2_v4f32_align1_ntstore:
; X64-SSE: # %bb.0:		; X64-SSE: # %bb.0:
; X64-SSE-NEXT: movups (%rdi), %xmm0		; X64-SSE-NEXT: movups (%rdi), %xmm0
; X64-SSE-NEXT: movups 16(%rdi), %xmm1		; X64-SSE-NEXT: movups 16(%rdi), %xmm1
; X64-SSE-NEXT: movups %xmm0, (%rsi)		; X64-SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
; X64-SSE-NEXT: movups %xmm1, 16(%rsi)		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-SSE-NEXT: movntiq %rcx, 8(%rsi)
		; X64-SSE-NEXT: movntiq %rax, (%rsi)
		; X64-SSE-NEXT: movaps %xmm1, -{{[0-9]+}}(%rsp)
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-SSE-NEXT: movntiq %rcx, 24(%rsi)
		; X64-SSE-NEXT: movntiq %rax, 16(%rsi)
; X64-SSE-NEXT: retq		; X64-SSE-NEXT: retq
;		;
; X64-AVX-LABEL: merge_2_v4f32_align1_ntstore:		; X64-AVX-LABEL: merge_2_v4f32_align1_ntstore:
; X64-AVX: # %bb.0:		; X64-AVX: # %bb.0:
; X64-AVX-NEXT: vmovups (%rdi), %ymm0		; X64-AVX-NEXT: vmovups (%rdi), %xmm0
; X64-AVX-NEXT: vmovups %ymm0, (%rsi)		; X64-AVX-NEXT: vmovups 16(%rdi), %xmm1
; X64-AVX-NEXT: vzeroupper		; X64-AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-AVX-NEXT: movntiq %rcx, 8(%rsi)
		; X64-AVX-NEXT: movntiq %rax, (%rsi)
		; X64-AVX-NEXT: vmovaps %xmm1, -{{[0-9]+}}(%rsp)
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-AVX-NEXT: movntiq %rcx, 24(%rsi)
		; X64-AVX-NEXT: movntiq %rax, 16(%rsi)
; X64-AVX-NEXT: retq		; X64-AVX-NEXT: retq
%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0		%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0
%2 = bitcast float* %1 to <4 x float>*		%2 = bitcast float* %1 to <4 x float>*
%3 = load <4 x float>, <4 x float>* %a0, align 1		%3 = load <4 x float>, <4 x float>* %a0, align 1
%4 = load <4 x float>, <4 x float>* %2, align 1		%4 = load <4 x float>, <4 x float>* %2, align 1
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 1, !nontemporal !0		store <4 x float> %3, <4 x float>* %a1, align 1, !nontemporal !0
store <4 x float> %4, <4 x float>* %6, align 1, !nontemporal !0		store <4 x float> %4, <4 x float>* %6, align 1, !nontemporal !0
ret void		ret void
}		}

; FIXME: Nothing can perform NT-load-vector on 1-byte aligned memory.		; Nothing can perform NT-load-vector on 1-byte aligned memory.
; Just perform regular loads and scalarize NT-stores.		; Just perform regular loads and scalarize NT-stores.
define void @merge_2_v4f32_align1(<4 x float>* %a0, <4 x float>* %a1) nounwind {		define void @merge_2_v4f32_align1(<4 x float>* %a0, <4 x float>* %a1) nounwind {
; X86-LABEL: merge_2_v4f32_align1:		; X86-LABEL: merge_2_v4f32_align1:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: pushl %ebp
; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-NEXT: movl %esp, %ebp
		; X86-NEXT: andl $-16, %esp
		; X86-NEXT: subl $48, %esp
		; X86-NEXT: movl 12(%ebp), %eax
		; X86-NEXT: movl 8(%ebp), %ecx
; X86-NEXT: movups (%ecx), %xmm0		; X86-NEXT: movups (%ecx), %xmm0
; X86-NEXT: movups 16(%ecx), %xmm1		; X86-NEXT: movups 16(%ecx), %xmm1
; X86-NEXT: movups %xmm0, (%eax)		; X86-NEXT: movaps %xmm0, {{[0-9]+}}(%esp)
; X86-NEXT: movups %xmm1, 16(%eax)		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 12(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 8(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movl {{[0-9]+}}(%esp), %edx
		; X86-NEXT: movntil %edx, 4(%eax)
		; X86-NEXT: movntil %ecx, (%eax)
		; X86-NEXT: movaps %xmm1, (%esp)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 28(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 24(%eax)
		; X86-NEXT: movl (%esp), %ecx
		; X86-NEXT: movl {{[0-9]+}}(%esp), %edx
		; X86-NEXT: movntil %edx, 20(%eax)
		; X86-NEXT: movntil %ecx, 16(%eax)
		; X86-NEXT: movl %ebp, %esp
		; X86-NEXT: popl %ebp
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-SSE-LABEL: merge_2_v4f32_align1:		; X64-SSE-LABEL: merge_2_v4f32_align1:
; X64-SSE: # %bb.0:		; X64-SSE: # %bb.0:
; X64-SSE-NEXT: movups (%rdi), %xmm0		; X64-SSE-NEXT: movups (%rdi), %xmm0
; X64-SSE-NEXT: movups 16(%rdi), %xmm1		; X64-SSE-NEXT: movups 16(%rdi), %xmm1
; X64-SSE-NEXT: movups %xmm0, (%rsi)		; X64-SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
; X64-SSE-NEXT: movups %xmm1, 16(%rsi)		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-SSE-NEXT: movntiq %rcx, 8(%rsi)
		; X64-SSE-NEXT: movntiq %rax, (%rsi)
		; X64-SSE-NEXT: movaps %xmm1, -{{[0-9]+}}(%rsp)
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-SSE-NEXT: movntiq %rcx, 24(%rsi)
		; X64-SSE-NEXT: movntiq %rax, 16(%rsi)
; X64-SSE-NEXT: retq		; X64-SSE-NEXT: retq
;		;
; X64-AVX-LABEL: merge_2_v4f32_align1:		; X64-AVX-LABEL: merge_2_v4f32_align1:
; X64-AVX: # %bb.0:		; X64-AVX: # %bb.0:
; X64-AVX-NEXT: vmovups (%rdi), %ymm0		; X64-AVX-NEXT: vmovups (%rdi), %xmm0
; X64-AVX-NEXT: vmovups %ymm0, (%rsi)		; X64-AVX-NEXT: vmovups 16(%rdi), %xmm1
; X64-AVX-NEXT: vzeroupper		; X64-AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-AVX-NEXT: movntiq %rcx, 8(%rsi)
		; X64-AVX-NEXT: movntiq %rax, (%rsi)
		; X64-AVX-NEXT: vmovaps %xmm1, -{{[0-9]+}}(%rsp)
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-AVX-NEXT: movntiq %rcx, 24(%rsi)
		; X64-AVX-NEXT: movntiq %rax, 16(%rsi)
; X64-AVX-NEXT: retq		; X64-AVX-NEXT: retq
%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0		%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0
%2 = bitcast float* %1 to <4 x float>*		%2 = bitcast float* %1 to <4 x float>*
%3 = load <4 x float>, <4 x float>* %a0, align 1, !nontemporal !0		%3 = load <4 x float>, <4 x float>* %a0, align 1, !nontemporal !0
%4 = load <4 x float>, <4 x float>* %2, align 1, !nontemporal !0		%4 = load <4 x float>, <4 x float>* %2, align 1, !nontemporal !0
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 1, !nontemporal !0		store <4 x float> %3, <4 x float>* %a1, align 1, !nontemporal !0
store <4 x float> %4, <4 x float>* %6, align 1, !nontemporal !0		store <4 x float> %4, <4 x float>* %6, align 1, !nontemporal !0
ret void		ret void
}		}

!0 = !{i32 1}		!0 = !{i32 1}

llvm/trunk/test/CodeGen/X86/nontemporal-3.ll

	Show All 9 Lines
	; Test codegen for under aligned nontemporal vector stores			; Test codegen for under aligned nontemporal vector stores

	; XMM versions.			; XMM versions.

	define void @test_zero_v2f64_align1(<2 x double>* %dst) nounwind {			define void @test_zero_v2f64_align1(<2 x double>* %dst) nounwind {
	; SSE-LABEL: test_zero_v2f64_align1:			; SSE-LABEL: test_zero_v2f64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v2f64_align1:			; AVX-LABEL: test_zero_v2f64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v2f64_align1:			; AVX512-LABEL: test_zero_v2f64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <2 x double> zeroinitializer, <2 x double>* %dst, align 1, !nontemporal !1			store <2 x double> zeroinitializer, <2 x double>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v4f32_align1(<4 x float>* %dst) nounwind {			define void @test_zero_v4f32_align1(<4 x float>* %dst) nounwind {
	; SSE-LABEL: test_zero_v4f32_align1:			; SSE-LABEL: test_zero_v4f32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v4f32_align1:			; AVX-LABEL: test_zero_v4f32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v4f32_align1:			; AVX512-LABEL: test_zero_v4f32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <4 x float> zeroinitializer, <4 x float>* %dst, align 1, !nontemporal !1			store <4 x float> zeroinitializer, <4 x float>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v2i64_align1(<2 x i64>* %dst) nounwind {			define void @test_zero_v2i64_align1(<2 x i64>* %dst) nounwind {
	; SSE-LABEL: test_zero_v2i64_align1:			; SSE-LABEL: test_zero_v2i64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v2i64_align1:			; AVX-LABEL: test_zero_v2i64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v2i64_align1:			; AVX512-LABEL: test_zero_v2i64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <2 x i64> zeroinitializer, <2 x i64>* %dst, align 1, !nontemporal !1			store <2 x i64> zeroinitializer, <2 x i64>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v4i32_align1(<4 x i32>* %dst) nounwind {			define void @test_zero_v4i32_align1(<4 x i32>* %dst) nounwind {
	; SSE-LABEL: test_zero_v4i32_align1:			; SSE-LABEL: test_zero_v4i32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v4i32_align1:			; AVX-LABEL: test_zero_v4i32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v4i32_align1:			; AVX512-LABEL: test_zero_v4i32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <4 x i32> zeroinitializer, <4 x i32>* %dst, align 1, !nontemporal !1			store <4 x i32> zeroinitializer, <4 x i32>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8i16_align1(<8 x i16>* %dst) nounwind {			define void @test_zero_v8i16_align1(<8 x i16>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8i16_align1:			; SSE-LABEL: test_zero_v8i16_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v8i16_align1:			; AVX-LABEL: test_zero_v8i16_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8i16_align1:			; AVX512-LABEL: test_zero_v8i16_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x i16> zeroinitializer, <8 x i16>* %dst, align 1, !nontemporal !1			store <8 x i16> zeroinitializer, <8 x i16>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16i8_align1(<16 x i8>* %dst) nounwind {			define void @test_zero_v16i8_align1(<16 x i8>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16i8_align1:			; SSE-LABEL: test_zero_v16i8_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v16i8_align1:			; AVX-LABEL: test_zero_v16i8_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16i8_align1:			; AVX512-LABEL: test_zero_v16i8_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x i8> zeroinitializer, <16 x i8>* %dst, align 1, !nontemporal !1			store <16 x i8> zeroinitializer, <16 x i8>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	; YMM versions.			; YMM versions.

	define void @test_zero_v4f64_align1(<4 x double>* %dst) nounwind {			define void @test_zero_v4f64_align1(<4 x double>* %dst) nounwind {
	; SSE-LABEL: test_zero_v4f64_align1:			; SSE-LABEL: test_zero_v4f64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v4f64_align1:			; AVX-LABEL: test_zero_v4f64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v4f64_align1:			; AVX512-LABEL: test_zero_v4f64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %ymm0, (%rdi)			; AVX512-NEXT: vmovups %ymm0, (%rdi)
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <4 x double> zeroinitializer, <4 x double>* %dst, align 1, !nontemporal !1			store <4 x double> zeroinitializer, <4 x double>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8f32_align1(<8 x float>* %dst) nounwind {			define void @test_zero_v8f32_align1(<8 x float>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8f32_align1:			; SSE-LABEL: test_zero_v8f32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v8f32_align1:			; AVX-LABEL: test_zero_v8f32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8f32_align1:			; AVX512-LABEL: test_zero_v8f32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %ymm0, (%rdi)			; AVX512-NEXT: vmovups %ymm0, (%rdi)
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x float> zeroinitializer, <8 x float>* %dst, align 1, !nontemporal !1			store <8 x float> zeroinitializer, <8 x float>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v4i64_align1(<4 x i64>* %dst) nounwind {			define void @test_zero_v4i64_align1(<4 x i64>* %dst) nounwind {
	; SSE-LABEL: test_zero_v4i64_align1:			; SSE-LABEL: test_zero_v4i64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v4i64_align1:			; AVX-LABEL: test_zero_v4i64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v4i64_align1:			; AVX512-LABEL: test_zero_v4i64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %ymm0, (%rdi)			; AVX512-NEXT: vmovups %ymm0, (%rdi)
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <4 x i64> zeroinitializer, <4 x i64>* %dst, align 1, !nontemporal !1			store <4 x i64> zeroinitializer, <4 x i64>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8i32_align1(<8 x i32>* %dst) nounwind {			define void @test_zero_v8i32_align1(<8 x i32>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8i32_align1:			; SSE-LABEL: test_zero_v8i32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v8i32_align1:			; AVX-LABEL: test_zero_v8i32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8i32_align1:			; AVX512-LABEL: test_zero_v8i32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %ymm0, (%rdi)			; AVX512-NEXT: vmovups %ymm0, (%rdi)
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x i32> zeroinitializer, <8 x i32>* %dst, align 1, !nontemporal !1			store <8 x i32> zeroinitializer, <8 x i32>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16i16_align1(<16 x i16>* %dst) nounwind {			define void @test_zero_v16i16_align1(<16 x i16>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16i16_align1:			; SSE-LABEL: test_zero_v16i16_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v16i16_align1:			; AVX-LABEL: test_zero_v16i16_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16i16_align1:			; AVX512-LABEL: test_zero_v16i16_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %ymm0, (%rdi)			; AVX512-NEXT: vmovups %ymm0, (%rdi)
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x i16> zeroinitializer, <16 x i16>* %dst, align 1, !nontemporal !1			store <16 x i16> zeroinitializer, <16 x i16>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v32i8_align1(<32 x i8>* %dst) nounwind {			define void @test_zero_v32i8_align1(<32 x i8>* %dst) nounwind {
	; SSE-LABEL: test_zero_v32i8_align1:			; SSE-LABEL: test_zero_v32i8_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v32i8_align1:			; AVX-LABEL: test_zero_v32i8_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	▲ Show 20 Lines • Show All 159 Lines • ▼ Show 20 Lines
	}			}

	; ZMM versions.			; ZMM versions.

	define void @test_zero_v8f64_align1(<8 x double>* %dst) nounwind {			define void @test_zero_v8f64_align1(<8 x double>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8f64_align1:			; SSE-LABEL: test_zero_v8f64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v8f64_align1:			; AVX-LABEL: test_zero_v8f64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8f64_align1:			; AVX512-LABEL: test_zero_v8f64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 56(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 48(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 40(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 32(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 24(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 16(%rdi)
				; AVX512-NEXT: movq (%rsp), %rax
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x double> zeroinitializer, <8 x double>* %dst, align 1, !nontemporal !1			store <8 x double> zeroinitializer, <8 x double>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16f32_align1(<16 x float>* %dst) nounwind {			define void @test_zero_v16f32_align1(<16 x float>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16f32_align1:			; SSE-LABEL: test_zero_v16f32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v16f32_align1:			; AVX-LABEL: test_zero_v16f32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16f32_align1:			; AVX512-LABEL: test_zero_v16f32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 56(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 48(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 40(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 32(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 24(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 16(%rdi)
				; AVX512-NEXT: movq (%rsp), %rax
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x float> zeroinitializer, <16 x float>* %dst, align 1, !nontemporal !1			store <16 x float> zeroinitializer, <16 x float>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8i64_align1(<8 x i64>* %dst) nounwind {			define void @test_zero_v8i64_align1(<8 x i64>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8i64_align1:			; SSE-LABEL: test_zero_v8i64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v8i64_align1:			; AVX-LABEL: test_zero_v8i64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8i64_align1:			; AVX512-LABEL: test_zero_v8i64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 56(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 48(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 40(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 32(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 24(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 16(%rdi)
				; AVX512-NEXT: movq (%rsp), %rax
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 1, !nontemporal !1			store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16i32_align1(<16 x i32>* %dst) nounwind {			define void @test_zero_v16i32_align1(<16 x i32>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16i32_align1:			; SSE-LABEL: test_zero_v16i32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v16i32_align1:			; AVX-LABEL: test_zero_v16i32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16i32_align1:			; AVX512-LABEL: test_zero_v16i32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 56(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 48(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 40(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 32(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 24(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 16(%rdi)
				; AVX512-NEXT: movq (%rsp), %rax
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 1, !nontemporal !1			store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v32i16_align1(<32 x i16>* %dst) nounwind {			define void @test_zero_v32i16_align1(<32 x i16>* %dst) nounwind {
	; SSE-LABEL: test_zero_v32i16_align1:			; SSE-LABEL: test_zero_v32i16_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v32i16_align1:			; AVX-LABEL: test_zero_v32i16_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512DQ-LABEL: test_zero_v32i16_align1:			; AVX512DQ-LABEL: test_zero_v32i16_align1:
	; AVX512DQ: # %bb.0:			; AVX512DQ: # %bb.0:
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v32i16_align1:			; AVX512BW-LABEL: test_zero_v32i16_align1:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 56(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 48(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 40(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 32(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 24(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 16(%rdi)
				; AVX512BW-NEXT: movq (%rsp), %rax
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512BW-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512BW-NEXT: movntiq %rax, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 1, !nontemporal !1			store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v64i8_align1(<64 x i8>* %dst) nounwind {			define void @test_zero_v64i8_align1(<64 x i8>* %dst) nounwind {
	; SSE-LABEL: test_zero_v64i8_align1:			; SSE-LABEL: test_zero_v64i8_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v64i8_align1:			; AVX-LABEL: test_zero_v64i8_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512DQ-LABEL: test_zero_v64i8_align1:			; AVX512DQ-LABEL: test_zero_v64i8_align1:
	; AVX512DQ: # %bb.0:			; AVX512DQ: # %bb.0:
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v64i8_align1:			; AVX512BW-LABEL: test_zero_v64i8_align1:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 56(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 48(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 40(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 32(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 24(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 16(%rdi)
				; AVX512BW-NEXT: movq (%rsp), %rax
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512BW-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512BW-NEXT: movntiq %rax, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 1, !nontemporal !1			store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8f64_align16(<8 x double>* %dst) nounwind {			define void @test_zero_v8f64_align16(<8 x double>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8f64_align16:			; SSE-LABEL: test_zero_v8f64_align16:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8f64_align16:			; AVX512-LABEL: test_zero_v8f64_align16:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %xmm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x double> zeroinitializer, <8 x double>* %dst, align 16, !nontemporal !1			store <8 x double> zeroinitializer, <8 x double>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16f32_align16(<16 x float>* %dst) nounwind {			define void @test_zero_v16f32_align16(<16 x float>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16f32_align16:			; SSE-LABEL: test_zero_v16f32_align16:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16f32_align16:			; AVX512-LABEL: test_zero_v16f32_align16:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %xmm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x float> zeroinitializer, <16 x float>* %dst, align 16, !nontemporal !1			store <16 x float> zeroinitializer, <16 x float>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8i64_align16(<8 x i64>* %dst) nounwind {			define void @test_zero_v8i64_align16(<8 x i64>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8i64_align16:			; SSE-LABEL: test_zero_v8i64_align16:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8i64_align16:			; AVX512-LABEL: test_zero_v8i64_align16:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %xmm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 16, !nontemporal !1			store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16i32_align16(<16 x i32>* %dst) nounwind {			define void @test_zero_v16i32_align16(<16 x i32>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16i32_align16:			; SSE-LABEL: test_zero_v16i32_align16:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16i32_align16:			; AVX512-LABEL: test_zero_v16i32_align16:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %xmm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 16, !nontemporal !1			store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v32i16_align16(<32 x i16>* %dst) nounwind {			define void @test_zero_v32i16_align16(<32 x i16>* %dst) nounwind {
	; SSE-LABEL: test_zero_v32i16_align16:			; SSE-LABEL: test_zero_v32i16_align16:
	Show All 18 Lines
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v32i16_align16:			; AVX512BW-LABEL: test_zero_v32i16_align16:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: vmovaps (%rsp), %xmm0
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512BW-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 16, !nontemporal !1			store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v64i8_align16(<64 x i8>* %dst) nounwind {			define void @test_zero_v64i8_align16(<64 x i8>* %dst) nounwind {
	; SSE-LABEL: test_zero_v64i8_align16:			; SSE-LABEL: test_zero_v64i8_align16:
	Show All 18 Lines
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v64i8_align16:			; AVX512BW-LABEL: test_zero_v64i8_align16:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: vmovaps (%rsp), %xmm0
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512BW-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 16, !nontemporal !1			store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8f64_align32(<8 x double>* %dst) nounwind {			define void @test_zero_v8f64_align32(<8 x double>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8f64_align32:			; SSE-LABEL: test_zero_v8f64_align32:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX-NEXT: vmovntps %ymm0, (%rdi)			; AVX-NEXT: vmovntps %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8f64_align32:			; AVX512-LABEL: test_zero_v8f64_align32:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %ymm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x double> zeroinitializer, <8 x double>* %dst, align 32, !nontemporal !1			store <8 x double> zeroinitializer, <8 x double>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16f32_align32(<16 x float>* %dst) nounwind {			define void @test_zero_v16f32_align32(<16 x float>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16f32_align32:			; SSE-LABEL: test_zero_v16f32_align32:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX-NEXT: vmovntps %ymm0, (%rdi)			; AVX-NEXT: vmovntps %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16f32_align32:			; AVX512-LABEL: test_zero_v16f32_align32:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %ymm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x float> zeroinitializer, <16 x float>* %dst, align 32, !nontemporal !1			store <16 x float> zeroinitializer, <16 x float>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8i64_align32(<8 x i64>* %dst) nounwind {			define void @test_zero_v8i64_align32(<8 x i64>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8i64_align32:			; SSE-LABEL: test_zero_v8i64_align32:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX-NEXT: vmovntps %ymm0, (%rdi)			; AVX-NEXT: vmovntps %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8i64_align32:			; AVX512-LABEL: test_zero_v8i64_align32:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %ymm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 32, !nontemporal !1			store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16i32_align32(<16 x i32>* %dst) nounwind {			define void @test_zero_v16i32_align32(<16 x i32>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16i32_align32:			; SSE-LABEL: test_zero_v16i32_align32:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX-NEXT: vmovntps %ymm0, (%rdi)			; AVX-NEXT: vmovntps %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16i32_align32:			; AVX512-LABEL: test_zero_v16i32_align32:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %ymm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 32, !nontemporal !1			store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v32i16_align32(<32 x i16>* %dst) nounwind {			define void @test_zero_v32i16_align32(<32 x i16>* %dst) nounwind {
	; SSE-LABEL: test_zero_v32i16_align32:			; SSE-LABEL: test_zero_v32i16_align32:
	Show All 18 Lines
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovntps %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovntps %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v32i16_align32:			; AVX512BW-LABEL: test_zero_v32i16_align32:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: vmovaps (%rsp), %ymm0
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512BW-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512BW-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 32, !nontemporal !1			store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v64i8_align32(<64 x i8>* %dst) nounwind {			define void @test_zero_v64i8_align32(<64 x i8>* %dst) nounwind {
	; SSE-LABEL: test_zero_v64i8_align32:			; SSE-LABEL: test_zero_v64i8_align32:
	Show All 18 Lines
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovntps %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovntps %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v64i8_align32:			; AVX512BW-LABEL: test_zero_v64i8_align32:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: vmovaps (%rsp), %ymm0
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512BW-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512BW-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 32, !nontemporal !1			store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	!1 = !{i32 1}			!1 = !{i32 1}