This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/X86/
-
Target/
-
X86/
1/3
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
merge-consecutive-stores-nt.ll
1
nontemporal-3.ll

Differential D63246

[X86][SSE] Prevent misaligned non-temporal vector load/store combines
ClosedPublic

Authored by RKSimon on Jun 13 2019, 3:02 AM.

Download Raw Diff

Details

Reviewers

craig.topper
andreadb
wristow
lebedev.ri

Commits

rG454e6b9010fe: [X86][SSE] Prevent misaligned non-temporal vector load/store combines
rL363564: [X86][SSE] Prevent misaligned non-temporal vector load/store combines

Summary

For loads, pre-SSE41 we can't perform NT loads at all, and after that we can only perform vector aligned loads so if the alignment is less than for a xmm we'll just end up using the regular unaligned vector loads anyway.

First step towards fixing PR42026 - the next step for stores will be to use SSE4A movntsd where possible and to avoid the stack spill on SSE2 targets.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon created this revision.Jun 13 2019, 3:02 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2019, 3:02 AM

RKSimon mentioned this in rL363247: [X86][SSE] Add SSE4A nt store tests on X86 as well as X64.Jun 13 2019, 3:27 AM

RKSimon mentioned this in rGe1aea8589683: [X86][SSE] Add SSE4A nt store tests on X86 as well as X64.

lebedev.ri added inline comments.Jun 14 2019, 4:08 PM

lib/Target/X86/X86ISelLowering.cpp
16–19	So this says that if this is a non-temporal load of a vector, either from a pointer that is aligned so little that we can't even make 128-bit aligned load from, or we do not even have SSE41 (so no aligned non-temporal vector loads at all), the underaligned loading is allowed, correct? This kinda looks backwards to me? I expected something like if (!!(Flags & MachineMemOperand::MONonTemporal) && VT.isVector()) { return !!(Flags & MachineMemOperand::MOLoad) && Align >= VT.getSizeInBytes() && Subtarget.hasSSE41()); }

Also, judging by the LHS of the diffs, there isn't anything to split those loads/stores currently?

In D63246#1544547, @lebedev.ri wrote:

Also, judging by the LHS of the diffs, there isn't anything to split those loads/stores currently?

I'm adding more tests as I can but need to get correct allowsMisalignedMemoryAccesses handling in before I can improve combineLoad and combienStore - otherwise the existing load/store merge combines will fight against us (e.g. nt-load splitting fails horribly....).

lib/Target/X86/X86ISelLowering.cpp
16–19	What I have looks correct to me - we need to return true if the misaligned load is acceptable - so if align<16 or we don't have SSE41 - but I will pull out more of the if() logic into the return. Also, allowsMisalignedMemoryAccesses won't have been called if Align >= VT.getSizeInBytes().

cleaned up nt-load if() logic - rebased with nontemporal-3.ll tests

lebedev.ri marked an inline comment as done.Jun 17 2019, 5:49 AM

lebedev.ri added inline comments.

lib/Target/X86/X86ISelLowering.cpp
16–19	Thinking about it more, yeah, the tests seem to agree that it works as intended, even though the `allowsMisalignedMemoryAccesses()` looks backwards.

Looks good to me.

test/CodeGen/X86/nontemporal-3.ll
388–393	This SSE sequence is clearly sub-optimal. That being said, I am not too worried about it given how unlucky this scenario is in practice. If possible, it would be nice to have it fixed in a follow-up patch. Basically, there is no reason why we should zero XMM0 to then store it on the stack... to then reload its elements on GPRs.. We should just zero a GPR and then have both MOVNTI use it. I suspect this has to do with how we lower certain nodes on SSE.

This revision is now accepted and ready to land.Jun 17 2019, 6:05 AM

Closed by commit rL363564: [X86][SSE] Prevent misaligned non-temporal vector load/store combines (authored by RKSimon). · Explain WhyJun 17 2019, 7:24 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

	X86ISelLowering.cpp
	X86ISelLowering.cpp (revision 363534)

17 lines

test/

CodeGen/

X86/

	merge-consecutive-stores-nt.ll
	merge-consecutive-stores-nt.ll (revision 363534)

152 lines

	nontemporal-3.ll
	nontemporal-3.ll (revision 363552)

660 lines

Diff 205050

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	//===-- X86ISelLowering.cpp - X86 DAG Lowering Implementation -------------===//			//===-- X86ISelLowering.cpp - X86 DAG Lowering Implementation -------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file defines the interfaces that X86 uses to lower LLVM code into a			// This file defines the interfaces that X86 uses to lower LLVM code into a
	// selection DAG.			// selection DAG.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "X86ISelLowering.h"			#include "X86ISelLowering.h"
	#include "Utils/X86ShuffleDecode.h"			#include "Utils/X86ShuffleDecode.h"
	#include "X86CallingConv.h"			#include "X86CallingConv.h"
	#include "X86FrameLowering.h"			#include "X86FrameLowering.h"
	#include "X86InstrBuilder.h"			#include "X86InstrBuilder.h"
	#include "X86IntrinsicsInfo.h"			#include "X86IntrinsicsInfo.h"
				lebedev.riUnsubmitted Not Done Reply Inline Actions So this says that if this is a non-temporal load of a vector, either from a pointer that is aligned so little that we can't even make 128-bit aligned load from, or we do not even have SSE41 (so no aligned non-temporal vector loads at all), the underaligned loading is allowed, correct? This kinda looks backwards to me? I expected something like if (!!(Flags & MachineMemOperand::MONonTemporal) && VT.isVector()) { return !!(Flags & MachineMemOperand::MOLoad) && Align >= VT.getSizeInBytes() && Subtarget.hasSSE41()); } lebedev.ri: So this says that if this is a non-temporal load of a vector, either from a pointer that is…
				RKSimonAuthorUnsubmitted Not Done Reply Inline Actions What I have looks correct to me - we need to return true if the misaligned load is acceptable - so if align<16 or we don't have SSE41 - but I will pull out more of the if() logic into the return. Also, allowsMisalignedMemoryAccesses won't have been called if Align >= VT.getSizeInBytes(). RKSimon: What I have looks correct to me - we need to return true if the misaligned load is acceptable…
				lebedev.riUnsubmitted Done Reply Inline Actions Thinking about it more, yeah, the tests seem to agree that it works as intended, even though the `allowsMisalignedMemoryAccesses()` looks backwards. lebedev.ri: Thinking about it more, yeah, the tests seem to agree that it works as intended, even though…
	#include "X86MachineFunctionInfo.h"			#include "X86MachineFunctionInfo.h"
	#include "X86TargetMachine.h"			#include "X86TargetMachine.h"
	#include "X86TargetObjectFile.h"			#include "X86TargetObjectFile.h"
	#include "llvm/ADT/SmallBitVector.h"			#include "llvm/ADT/SmallBitVector.h"
	#include "llvm/ADT/SmallSet.h"			#include "llvm/ADT/SmallSet.h"
	#include "llvm/ADT/Statistic.h"			#include "llvm/ADT/Statistic.h"
	#include "llvm/ADT/StringExtras.h"			#include "llvm/ADT/StringExtras.h"
	#include "llvm/ADT/StringSwitch.h"			#include "llvm/ADT/StringSwitch.h"
	▲ Show 20 Lines • Show All 2,073 Lines • ▼ Show 20 Lines
	bool X86TargetLowering::isSafeMemOpType(MVT VT) const {			bool X86TargetLowering::isSafeMemOpType(MVT VT) const {
	if (VT == MVT::f32)			if (VT == MVT::f32)
	return X86ScalarSSEf32;			return X86ScalarSSEf32;
	else if (VT == MVT::f64)			else if (VT == MVT::f64)
	return X86ScalarSSEf64;			return X86ScalarSSEf64;
	return true;			return true;
	}			}

	bool X86TargetLowering::allowsMisalignedMemoryAccesses(EVT VT, unsigned,			bool X86TargetLowering::allowsMisalignedMemoryAccesses(
	unsigned,			EVT VT, unsigned, unsigned Align, MachineMemOperand::Flags Flags,
	MachineMemOperand::Flags,
	bool *Fast) const {			bool *Fast) const {
	if (Fast) {			if (Fast) {
	switch (VT.getSizeInBits()) {			switch (VT.getSizeInBits()) {
	default:			default:
	// 8-byte and under are always assumed to be fast.			// 8-byte and under are always assumed to be fast.
	*Fast = true;			*Fast = true;
	break;			break;
	case 128:			case 128:
	*Fast = !Subtarget.isUnalignedMem16Slow();			*Fast = !Subtarget.isUnalignedMem16Slow();
	break;			break;
	case 256:			case 256:
	*Fast = !Subtarget.isUnalignedMem32Slow();			*Fast = !Subtarget.isUnalignedMem32Slow();
	break;			break;
	// TODO: What about AVX-512 (512-bit) accesses?			// TODO: What about AVX-512 (512-bit) accesses?
	}			}
	}			}
				// NonTemporal vector memory ops must be aligned.
				if (!!(Flags & MachineMemOperand::MONonTemporal) && VT.isVector()) {
				// NT loads can only be vector aligned, so if its less aligned than the
				// minimum vector size (which we can split the vector down to), we might as
				// well use a regular unaligned vector load.
				// We don't have any NT loads pre-SSE41.
				if (!!(Flags & MachineMemOperand::MOLoad))
				return (Align < 16 \|\| !Subtarget.hasSSE41());
				return false;
				}
	// Misaligned accesses of any size are always allowed.			// Misaligned accesses of any size are always allowed.
	return true;			return true;
	}			}

	/// Return the entry encoding for a jump table in the			/// Return the entry encoding for a jump table in the
	/// current function. The returned value is a member of the			/// current function. The returned value is a member of the
	/// MachineJumpTableInfo::JTEntryKind enum.			/// MachineJumpTableInfo::JTEntryKind enum.
	unsigned X86TargetLowering::getJumpTableEncoding() const {			unsigned X86TargetLowering::getJumpTableEncoding() const {
	▲ Show 20 Lines • Show All 42,792 Lines • Show Last 20 Lines

test/CodeGen/X86/merge-consecutive-stores-nt.ll

Show First 20 Lines • Show All 157 Lines • ▼ Show 20 Lines	; X64-AVX-NEXT: retq
%4 = load <4 x float>, <4 x float>* %2, align 16		%4 = load <4 x float>, <4 x float>* %2, align 16
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 32, !nontemporal !0		store <4 x float> %3, <4 x float>* %a1, align 32, !nontemporal !0
store <4 x float> %4, <4 x float>* %6, align 16		store <4 x float> %4, <4 x float>* %6, align 16
ret void		ret void
}		}

; FIXME: AVX2 can't perform NT-load-ymm on 16-byte aligned memory.		; AVX2 can't perform NT-load-ymm on 16-byte aligned memory.
; Must be kept seperate as VMOVNTDQA xmm.		; Must be kept seperate as VMOVNTDQA xmm.
define void @merge_2_v4f32_align16_ntload(<4 x float>* %a0, <4 x float>* %a1) nounwind {		define void @merge_2_v4f32_align16_ntload(<4 x float>* %a0, <4 x float>* %a1) nounwind {
; X86-LABEL: merge_2_v4f32_align16_ntload:		; X86-LABEL: merge_2_v4f32_align16_ntload:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-NEXT: movaps (%ecx), %xmm0		; X86-NEXT: movaps (%ecx), %xmm0
; X86-NEXT: movaps 16(%ecx), %xmm1		; X86-NEXT: movaps 16(%ecx), %xmm1
Show All 20 Lines
; X64-SSE41-LABEL: merge_2_v4f32_align16_ntload:		; X64-SSE41-LABEL: merge_2_v4f32_align16_ntload:
; X64-SSE41: # %bb.0:		; X64-SSE41: # %bb.0:
; X64-SSE41-NEXT: movntdqa (%rdi), %xmm0		; X64-SSE41-NEXT: movntdqa (%rdi), %xmm0
; X64-SSE41-NEXT: movntdqa 16(%rdi), %xmm1		; X64-SSE41-NEXT: movntdqa 16(%rdi), %xmm1
; X64-SSE41-NEXT: movdqa %xmm0, (%rsi)		; X64-SSE41-NEXT: movdqa %xmm0, (%rsi)
; X64-SSE41-NEXT: movdqa %xmm1, 16(%rsi)		; X64-SSE41-NEXT: movdqa %xmm1, 16(%rsi)
; X64-SSE41-NEXT: retq		; X64-SSE41-NEXT: retq
;		;
; X64-AVX1-LABEL: merge_2_v4f32_align16_ntload:		; X64-AVX-LABEL: merge_2_v4f32_align16_ntload:
; X64-AVX1: # %bb.0:		; X64-AVX: # %bb.0:
; X64-AVX1-NEXT: vmovntdqa (%rdi), %xmm0		; X64-AVX-NEXT: vmovntdqa (%rdi), %xmm0
; X64-AVX1-NEXT: vmovntdqa 16(%rdi), %xmm1		; X64-AVX-NEXT: vmovntdqa 16(%rdi), %xmm1
; X64-AVX1-NEXT: vmovdqa %xmm1, 16(%rsi)		; X64-AVX-NEXT: vmovdqa %xmm0, (%rsi)
; X64-AVX1-NEXT: vmovdqa %xmm0, (%rsi)		; X64-AVX-NEXT: vmovdqa %xmm1, 16(%rsi)
; X64-AVX1-NEXT: retq		; X64-AVX-NEXT: retq
;
; X64-AVX2-LABEL: merge_2_v4f32_align16_ntload:
; X64-AVX2: # %bb.0:
; X64-AVX2-NEXT: vmovups (%rdi), %ymm0
; X64-AVX2-NEXT: vmovups %ymm0, (%rsi)
; X64-AVX2-NEXT: vzeroupper
; X64-AVX2-NEXT: retq
%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0		%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0
%2 = bitcast float* %1 to <4 x float>*		%2 = bitcast float* %1 to <4 x float>*
%3 = load <4 x float>, <4 x float>* %a0, align 16, !nontemporal !0		%3 = load <4 x float>, <4 x float>* %a0, align 16, !nontemporal !0
%4 = load <4 x float>, <4 x float>* %2, align 16, !nontemporal !0		%4 = load <4 x float>, <4 x float>* %2, align 16, !nontemporal !0
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 16		store <4 x float> %3, <4 x float>* %a1, align 16
store <4 x float> %4, <4 x float>* %6, align 16		store <4 x float> %4, <4 x float>* %6, align 16
ret void		ret void
}		}

; FIXME: AVX can't perform NT-store-ymm on 16-byte aligned memory.		; AVX can't perform NT-store-ymm on 16-byte aligned memory.
; Must be kept seperate as VMOVNTPS xmm.		; Must be kept seperate as VMOVNTPS xmm.
define void @merge_2_v4f32_align16_ntstore(<4 x float>* %a0, <4 x float>* %a1) nounwind {		define void @merge_2_v4f32_align16_ntstore(<4 x float>* %a0, <4 x float>* %a1) nounwind {
; X86-LABEL: merge_2_v4f32_align16_ntstore:		; X86-LABEL: merge_2_v4f32_align16_ntstore:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-NEXT: movaps (%ecx), %xmm0		; X86-NEXT: movaps (%ecx), %xmm0
; X86-NEXT: movaps 16(%ecx), %xmm1		; X86-NEXT: movaps 16(%ecx), %xmm1
; X86-NEXT: movntps %xmm0, (%eax)		; X86-NEXT: movntps %xmm0, (%eax)
; X86-NEXT: movntps %xmm1, 16(%eax)		; X86-NEXT: movntps %xmm1, 16(%eax)
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-SSE-LABEL: merge_2_v4f32_align16_ntstore:		; X64-SSE-LABEL: merge_2_v4f32_align16_ntstore:
; X64-SSE: # %bb.0:		; X64-SSE: # %bb.0:
; X64-SSE-NEXT: movaps (%rdi), %xmm0		; X64-SSE-NEXT: movaps (%rdi), %xmm0
; X64-SSE-NEXT: movaps 16(%rdi), %xmm1		; X64-SSE-NEXT: movaps 16(%rdi), %xmm1
; X64-SSE-NEXT: movntps %xmm0, (%rsi)		; X64-SSE-NEXT: movntps %xmm0, (%rsi)
; X64-SSE-NEXT: movntps %xmm1, 16(%rsi)		; X64-SSE-NEXT: movntps %xmm1, 16(%rsi)
; X64-SSE-NEXT: retq		; X64-SSE-NEXT: retq
;		;
; X64-AVX-LABEL: merge_2_v4f32_align16_ntstore:		; X64-AVX-LABEL: merge_2_v4f32_align16_ntstore:
; X64-AVX: # %bb.0:		; X64-AVX: # %bb.0:
; X64-AVX-NEXT: vmovups (%rdi), %ymm0		; X64-AVX-NEXT: vmovaps (%rdi), %xmm0
; X64-AVX-NEXT: vmovups %ymm0, (%rsi)		; X64-AVX-NEXT: vmovaps 16(%rdi), %xmm1
; X64-AVX-NEXT: vzeroupper		; X64-AVX-NEXT: vmovntps %xmm0, (%rsi)
		; X64-AVX-NEXT: vmovntps %xmm1, 16(%rsi)
; X64-AVX-NEXT: retq		; X64-AVX-NEXT: retq
%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0		%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0
%2 = bitcast float* %1 to <4 x float>*		%2 = bitcast float* %1 to <4 x float>*
%3 = load <4 x float>, <4 x float>* %a0, align 16		%3 = load <4 x float>, <4 x float>* %a0, align 16
%4 = load <4 x float>, <4 x float>* %2, align 16		%4 = load <4 x float>, <4 x float>* %2, align 16
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 16, !nontemporal !0		store <4 x float> %3, <4 x float>* %a1, align 16, !nontemporal !0
store <4 x float> %4, <4 x float>* %6, align 16, !nontemporal !0		store <4 x float> %4, <4 x float>* %6, align 16, !nontemporal !0
ret void		ret void
}		}

; FIXME: Nothing can perform NT-load-vector on 1-byte aligned memory.		; Nothing can perform NT-load-vector on 1-byte aligned memory.
; Just perform regular loads.		; Just perform regular loads.
define void @merge_2_v4f32_align1_ntload(<4 x float>* %a0, <4 x float>* %a1) nounwind {		define void @merge_2_v4f32_align1_ntload(<4 x float>* %a0, <4 x float>* %a1) nounwind {
; X86-LABEL: merge_2_v4f32_align1_ntload:		; X86-LABEL: merge_2_v4f32_align1_ntload:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-NEXT: movups (%ecx), %xmm0		; X86-NEXT: movups (%ecx), %xmm0
; X86-NEXT: movups 16(%ecx), %xmm1		; X86-NEXT: movups 16(%ecx), %xmm1
Show All 21 Lines	; X64-AVX-NEXT: retq
%4 = load <4 x float>, <4 x float>* %2, align 1, !nontemporal !0		%4 = load <4 x float>, <4 x float>* %2, align 1, !nontemporal !0
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 1		store <4 x float> %3, <4 x float>* %a1, align 1
store <4 x float> %4, <4 x float>* %6, align 1		store <4 x float> %4, <4 x float>* %6, align 1
ret void		ret void
}		}

; FIXME: Nothing can perform NT-store-vector on 1-byte aligned memory.		; Nothing can perform NT-store-vector on 1-byte aligned memory.
; Must be scalarized to use MOVTNI/MOVNTSD.		; Must be scalarized to use MOVTNI/MOVNTSD.
define void @merge_2_v4f32_align1_ntstore(<4 x float>* %a0, <4 x float>* %a1) nounwind {		define void @merge_2_v4f32_align1_ntstore(<4 x float>* %a0, <4 x float>* %a1) nounwind {
; X86-LABEL: merge_2_v4f32_align1_ntstore:		; X86-LABEL: merge_2_v4f32_align1_ntstore:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: pushl %ebp
; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-NEXT: movl %esp, %ebp
		; X86-NEXT: andl $-16, %esp
		; X86-NEXT: subl $48, %esp
		; X86-NEXT: movl 12(%ebp), %eax
		; X86-NEXT: movl 8(%ebp), %ecx
; X86-NEXT: movups (%ecx), %xmm0		; X86-NEXT: movups (%ecx), %xmm0
; X86-NEXT: movups 16(%ecx), %xmm1		; X86-NEXT: movups 16(%ecx), %xmm1
; X86-NEXT: movups %xmm0, (%eax)		; X86-NEXT: movaps %xmm0, {{[0-9]+}}(%esp)
; X86-NEXT: movups %xmm1, 16(%eax)		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 12(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 8(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movl {{[0-9]+}}(%esp), %edx
		; X86-NEXT: movntil %edx, 4(%eax)
		; X86-NEXT: movntil %ecx, (%eax)
		; X86-NEXT: movaps %xmm1, (%esp)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 28(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 24(%eax)
		; X86-NEXT: movl (%esp), %ecx
		; X86-NEXT: movl {{[0-9]+}}(%esp), %edx
		; X86-NEXT: movntil %edx, 20(%eax)
		; X86-NEXT: movntil %ecx, 16(%eax)
		; X86-NEXT: movl %ebp, %esp
		; X86-NEXT: popl %ebp
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-SSE-LABEL: merge_2_v4f32_align1_ntstore:		; X64-SSE-LABEL: merge_2_v4f32_align1_ntstore:
; X64-SSE: # %bb.0:		; X64-SSE: # %bb.0:
; X64-SSE-NEXT: movups (%rdi), %xmm0		; X64-SSE-NEXT: movups (%rdi), %xmm0
; X64-SSE-NEXT: movups 16(%rdi), %xmm1		; X64-SSE-NEXT: movups 16(%rdi), %xmm1
; X64-SSE-NEXT: movups %xmm0, (%rsi)		; X64-SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
; X64-SSE-NEXT: movups %xmm1, 16(%rsi)		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-SSE-NEXT: movntiq %rcx, 8(%rsi)
		; X64-SSE-NEXT: movntiq %rax, (%rsi)
		; X64-SSE-NEXT: movaps %xmm1, -{{[0-9]+}}(%rsp)
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-SSE-NEXT: movntiq %rcx, 24(%rsi)
		; X64-SSE-NEXT: movntiq %rax, 16(%rsi)
; X64-SSE-NEXT: retq		; X64-SSE-NEXT: retq
;		;
; X64-AVX-LABEL: merge_2_v4f32_align1_ntstore:		; X64-AVX-LABEL: merge_2_v4f32_align1_ntstore:
; X64-AVX: # %bb.0:		; X64-AVX: # %bb.0:
; X64-AVX-NEXT: vmovups (%rdi), %ymm0		; X64-AVX-NEXT: vmovups (%rdi), %xmm0
; X64-AVX-NEXT: vmovups %ymm0, (%rsi)		; X64-AVX-NEXT: vmovups 16(%rdi), %xmm1
; X64-AVX-NEXT: vzeroupper		; X64-AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-AVX-NEXT: movntiq %rcx, 8(%rsi)
		; X64-AVX-NEXT: movntiq %rax, (%rsi)
		; X64-AVX-NEXT: vmovaps %xmm1, -{{[0-9]+}}(%rsp)
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-AVX-NEXT: movntiq %rcx, 24(%rsi)
		; X64-AVX-NEXT: movntiq %rax, 16(%rsi)
; X64-AVX-NEXT: retq		; X64-AVX-NEXT: retq
%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0		%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0
%2 = bitcast float* %1 to <4 x float>*		%2 = bitcast float* %1 to <4 x float>*
%3 = load <4 x float>, <4 x float>* %a0, align 1		%3 = load <4 x float>, <4 x float>* %a0, align 1
%4 = load <4 x float>, <4 x float>* %2, align 1		%4 = load <4 x float>, <4 x float>* %2, align 1
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 1, !nontemporal !0		store <4 x float> %3, <4 x float>* %a1, align 1, !nontemporal !0
store <4 x float> %4, <4 x float>* %6, align 1, !nontemporal !0		store <4 x float> %4, <4 x float>* %6, align 1, !nontemporal !0
ret void		ret void
}		}

; FIXME: Nothing can perform NT-load-vector on 1-byte aligned memory.		; Nothing can perform NT-load-vector on 1-byte aligned memory.
; Just perform regular loads and scalarize NT-stores.		; Just perform regular loads and scalarize NT-stores.
define void @merge_2_v4f32_align1(<4 x float>* %a0, <4 x float>* %a1) nounwind {		define void @merge_2_v4f32_align1(<4 x float>* %a0, <4 x float>* %a1) nounwind {
; X86-LABEL: merge_2_v4f32_align1:		; X86-LABEL: merge_2_v4f32_align1:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: pushl %ebp
; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-NEXT: movl %esp, %ebp
		; X86-NEXT: andl $-16, %esp
		; X86-NEXT: subl $48, %esp
		; X86-NEXT: movl 12(%ebp), %eax
		; X86-NEXT: movl 8(%ebp), %ecx
; X86-NEXT: movups (%ecx), %xmm0		; X86-NEXT: movups (%ecx), %xmm0
; X86-NEXT: movups 16(%ecx), %xmm1		; X86-NEXT: movups 16(%ecx), %xmm1
; X86-NEXT: movups %xmm0, (%eax)		; X86-NEXT: movaps %xmm0, {{[0-9]+}}(%esp)
; X86-NEXT: movups %xmm1, 16(%eax)		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 12(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 8(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movl {{[0-9]+}}(%esp), %edx
		; X86-NEXT: movntil %edx, 4(%eax)
		; X86-NEXT: movntil %ecx, (%eax)
		; X86-NEXT: movaps %xmm1, (%esp)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 28(%eax)
		; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NEXT: movntil %ecx, 24(%eax)
		; X86-NEXT: movl (%esp), %ecx
		; X86-NEXT: movl {{[0-9]+}}(%esp), %edx
		; X86-NEXT: movntil %edx, 20(%eax)
		; X86-NEXT: movntil %ecx, 16(%eax)
		; X86-NEXT: movl %ebp, %esp
		; X86-NEXT: popl %ebp
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-SSE-LABEL: merge_2_v4f32_align1:		; X64-SSE-LABEL: merge_2_v4f32_align1:
; X64-SSE: # %bb.0:		; X64-SSE: # %bb.0:
; X64-SSE-NEXT: movups (%rdi), %xmm0		; X64-SSE-NEXT: movups (%rdi), %xmm0
; X64-SSE-NEXT: movups 16(%rdi), %xmm1		; X64-SSE-NEXT: movups 16(%rdi), %xmm1
; X64-SSE-NEXT: movups %xmm0, (%rsi)		; X64-SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
; X64-SSE-NEXT: movups %xmm1, 16(%rsi)		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-SSE-NEXT: movntiq %rcx, 8(%rsi)
		; X64-SSE-NEXT: movntiq %rax, (%rsi)
		; X64-SSE-NEXT: movaps %xmm1, -{{[0-9]+}}(%rsp)
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-SSE-NEXT: movntiq %rcx, 24(%rsi)
		; X64-SSE-NEXT: movntiq %rax, 16(%rsi)
; X64-SSE-NEXT: retq		; X64-SSE-NEXT: retq
;		;
; X64-AVX-LABEL: merge_2_v4f32_align1:		; X64-AVX-LABEL: merge_2_v4f32_align1:
; X64-AVX: # %bb.0:		; X64-AVX: # %bb.0:
; X64-AVX-NEXT: vmovups (%rdi), %ymm0		; X64-AVX-NEXT: vmovups (%rdi), %xmm0
; X64-AVX-NEXT: vmovups %ymm0, (%rsi)		; X64-AVX-NEXT: vmovups 16(%rdi), %xmm1
; X64-AVX-NEXT: vzeroupper		; X64-AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-AVX-NEXT: movntiq %rcx, 8(%rsi)
		; X64-AVX-NEXT: movntiq %rax, (%rsi)
		; X64-AVX-NEXT: vmovaps %xmm1, -{{[0-9]+}}(%rsp)
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
		; X64-AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
		; X64-AVX-NEXT: movntiq %rcx, 24(%rsi)
		; X64-AVX-NEXT: movntiq %rax, 16(%rsi)
; X64-AVX-NEXT: retq		; X64-AVX-NEXT: retq
%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0		%1 = getelementptr inbounds <4 x float>, <4 x float>* %a0, i64 1, i64 0
%2 = bitcast float* %1 to <4 x float>*		%2 = bitcast float* %1 to <4 x float>*
%3 = load <4 x float>, <4 x float>* %a0, align 1, !nontemporal !0		%3 = load <4 x float>, <4 x float>* %a0, align 1, !nontemporal !0
%4 = load <4 x float>, <4 x float>* %2, align 1, !nontemporal !0		%4 = load <4 x float>, <4 x float>* %2, align 1, !nontemporal !0
%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0		%5 = getelementptr inbounds <4 x float>, <4 x float>* %a1, i64 1, i64 0
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
store <4 x float> %3, <4 x float>* %a1, align 1, !nontemporal !0		store <4 x float> %3, <4 x float>* %a1, align 1, !nontemporal !0
store <4 x float> %4, <4 x float>* %6, align 1, !nontemporal !0		store <4 x float> %4, <4 x float>* %6, align 1, !nontemporal !0
ret void		ret void
}		}

!0 = !{i32 1}		!0 = !{i32 1}

test/CodeGen/X86/nontemporal-3.ll

	Show All 9 Lines
	; Test codegen for under aligned nontemporal vector stores			; Test codegen for under aligned nontemporal vector stores

	; XMM versions.			; XMM versions.

	define void @test_zero_v2f64_align1(<2 x double>* %dst) nounwind {			define void @test_zero_v2f64_align1(<2 x double>* %dst) nounwind {
	; SSE-LABEL: test_zero_v2f64_align1:			; SSE-LABEL: test_zero_v2f64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v2f64_align1:			; AVX-LABEL: test_zero_v2f64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v2f64_align1:			; AVX512-LABEL: test_zero_v2f64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <2 x double> zeroinitializer, <2 x double>* %dst, align 1, !nontemporal !1			store <2 x double> zeroinitializer, <2 x double>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v4f32_align1(<4 x float>* %dst) nounwind {			define void @test_zero_v4f32_align1(<4 x float>* %dst) nounwind {
	; SSE-LABEL: test_zero_v4f32_align1:			; SSE-LABEL: test_zero_v4f32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v4f32_align1:			; AVX-LABEL: test_zero_v4f32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v4f32_align1:			; AVX512-LABEL: test_zero_v4f32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <4 x float> zeroinitializer, <4 x float>* %dst, align 1, !nontemporal !1			store <4 x float> zeroinitializer, <4 x float>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v2i64_align1(<2 x i64>* %dst) nounwind {			define void @test_zero_v2i64_align1(<2 x i64>* %dst) nounwind {
	; SSE-LABEL: test_zero_v2i64_align1:			; SSE-LABEL: test_zero_v2i64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v2i64_align1:			; AVX-LABEL: test_zero_v2i64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v2i64_align1:			; AVX512-LABEL: test_zero_v2i64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <2 x i64> zeroinitializer, <2 x i64>* %dst, align 1, !nontemporal !1			store <2 x i64> zeroinitializer, <2 x i64>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v4i32_align1(<4 x i32>* %dst) nounwind {			define void @test_zero_v4i32_align1(<4 x i32>* %dst) nounwind {
	; SSE-LABEL: test_zero_v4i32_align1:			; SSE-LABEL: test_zero_v4i32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v4i32_align1:			; AVX-LABEL: test_zero_v4i32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v4i32_align1:			; AVX512-LABEL: test_zero_v4i32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <4 x i32> zeroinitializer, <4 x i32>* %dst, align 1, !nontemporal !1			store <4 x i32> zeroinitializer, <4 x i32>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8i16_align1(<8 x i16>* %dst) nounwind {			define void @test_zero_v8i16_align1(<8 x i16>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8i16_align1:			; SSE-LABEL: test_zero_v8i16_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v8i16_align1:			; AVX-LABEL: test_zero_v8i16_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8i16_align1:			; AVX512-LABEL: test_zero_v8i16_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x i16> zeroinitializer, <8 x i16>* %dst, align 1, !nontemporal !1			store <8 x i16> zeroinitializer, <8 x i16>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16i8_align1(<16 x i8>* %dst) nounwind {			define void @test_zero_v16i8_align1(<16 x i8>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16i8_align1:			; SSE-LABEL: test_zero_v16i8_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v16i8_align1:			; AVX-LABEL: test_zero_v16i8_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %xmm0, (%rdi)			; AVX-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX-NEXT: movntiq %rcx, 8(%rdi)
				; AVX-NEXT: movntiq %rax, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16i8_align1:			; AVX512-LABEL: test_zero_v16i8_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %xmm0, (%rdi)			; AVX512-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x i8> zeroinitializer, <16 x i8>* %dst, align 1, !nontemporal !1			store <16 x i8> zeroinitializer, <16 x i8>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	; YMM versions.			; YMM versions.

	define void @test_zero_v4f64_align1(<4 x double>* %dst) nounwind {			define void @test_zero_v4f64_align1(<4 x double>* %dst) nounwind {
	; SSE-LABEL: test_zero_v4f64_align1:			; SSE-LABEL: test_zero_v4f64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v4f64_align1:			; AVX-LABEL: test_zero_v4f64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v4f64_align1:			; AVX512-LABEL: test_zero_v4f64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %ymm0, (%rdi)			; AVX512-NEXT: vmovups %ymm0, (%rdi)
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <4 x double> zeroinitializer, <4 x double>* %dst, align 1, !nontemporal !1			store <4 x double> zeroinitializer, <4 x double>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8f32_align1(<8 x float>* %dst) nounwind {			define void @test_zero_v8f32_align1(<8 x float>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8f32_align1:			; SSE-LABEL: test_zero_v8f32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v8f32_align1:			; AVX-LABEL: test_zero_v8f32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8f32_align1:			; AVX512-LABEL: test_zero_v8f32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %ymm0, (%rdi)			; AVX512-NEXT: vmovups %ymm0, (%rdi)
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x float> zeroinitializer, <8 x float>* %dst, align 1, !nontemporal !1			store <8 x float> zeroinitializer, <8 x float>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v4i64_align1(<4 x i64>* %dst) nounwind {			define void @test_zero_v4i64_align1(<4 x i64>* %dst) nounwind {
	; SSE-LABEL: test_zero_v4i64_align1:			; SSE-LABEL: test_zero_v4i64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v4i64_align1:			; AVX-LABEL: test_zero_v4i64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v4i64_align1:			; AVX512-LABEL: test_zero_v4i64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %ymm0, (%rdi)			; AVX512-NEXT: vmovups %ymm0, (%rdi)
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <4 x i64> zeroinitializer, <4 x i64>* %dst, align 1, !nontemporal !1			store <4 x i64> zeroinitializer, <4 x i64>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8i32_align1(<8 x i32>* %dst) nounwind {			define void @test_zero_v8i32_align1(<8 x i32>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8i32_align1:			; SSE-LABEL: test_zero_v8i32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v8i32_align1:			; AVX-LABEL: test_zero_v8i32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8i32_align1:			; AVX512-LABEL: test_zero_v8i32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %ymm0, (%rdi)			; AVX512-NEXT: vmovups %ymm0, (%rdi)
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x i32> zeroinitializer, <8 x i32>* %dst, align 1, !nontemporal !1			store <8 x i32> zeroinitializer, <8 x i32>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16i16_align1(<16 x i16>* %dst) nounwind {			define void @test_zero_v16i16_align1(<16 x i16>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16i16_align1:			; SSE-LABEL: test_zero_v16i16_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v16i16_align1:			; AVX-LABEL: test_zero_v16i16_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16i16_align1:			; AVX512-LABEL: test_zero_v16i16_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmovups %ymm0, (%rdi)			; AVX512-NEXT: vmovups %ymm0, (%rdi)
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x i16> zeroinitializer, <16 x i16>* %dst, align 1, !nontemporal !1			store <16 x i16> zeroinitializer, <16 x i16>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v32i8_align1(<32 x i8>* %dst) nounwind {			define void @test_zero_v32i8_align1(<32 x i8>* %dst) nounwind {
	; SSE-LABEL: test_zero_v32i8_align1:			; SSE-LABEL: test_zero_v32i8_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				andreadbUnsubmitted Not Done Reply Inline Actions This SSE sequence is clearly sub-optimal. That being said, I am not too worried about it given how unlucky this scenario is in practice. If possible, it would be nice to have it fixed in a follow-up patch. Basically, there is no reason why we should zero XMM0 to then store it on the stack... to then reload its elements on GPRs.. We should just zero a GPR and then have both MOVNTI use it. I suspect this has to do with how we lower certain nodes on SSE. andreadb: This SSE sequence is clearly sub-optimal. That being said, I am not too worried about it given…
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v32i8_align1:			; AVX-LABEL: test_zero_v32i8_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	▲ Show 20 Lines • Show All 159 Lines • ▼ Show 20 Lines
	}			}

	; ZMM versions.			; ZMM versions.

	define void @test_zero_v8f64_align1(<8 x double>* %dst) nounwind {			define void @test_zero_v8f64_align1(<8 x double>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8f64_align1:			; SSE-LABEL: test_zero_v8f64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v8f64_align1:			; AVX-LABEL: test_zero_v8f64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8f64_align1:			; AVX512-LABEL: test_zero_v8f64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 56(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 48(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 40(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 32(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 24(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 16(%rdi)
				; AVX512-NEXT: movq (%rsp), %rax
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x double> zeroinitializer, <8 x double>* %dst, align 1, !nontemporal !1			store <8 x double> zeroinitializer, <8 x double>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16f32_align1(<16 x float>* %dst) nounwind {			define void @test_zero_v16f32_align1(<16 x float>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16f32_align1:			; SSE-LABEL: test_zero_v16f32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v16f32_align1:			; AVX-LABEL: test_zero_v16f32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16f32_align1:			; AVX512-LABEL: test_zero_v16f32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 56(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 48(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 40(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 32(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 24(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 16(%rdi)
				; AVX512-NEXT: movq (%rsp), %rax
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x float> zeroinitializer, <16 x float>* %dst, align 1, !nontemporal !1			store <16 x float> zeroinitializer, <16 x float>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8i64_align1(<8 x i64>* %dst) nounwind {			define void @test_zero_v8i64_align1(<8 x i64>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8i64_align1:			; SSE-LABEL: test_zero_v8i64_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v8i64_align1:			; AVX-LABEL: test_zero_v8i64_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8i64_align1:			; AVX512-LABEL: test_zero_v8i64_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 56(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 48(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 40(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 32(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 24(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 16(%rdi)
				; AVX512-NEXT: movq (%rsp), %rax
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 1, !nontemporal !1			store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16i32_align1(<16 x i32>* %dst) nounwind {			define void @test_zero_v16i32_align1(<16 x i32>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16i32_align1:			; SSE-LABEL: test_zero_v16i32_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v16i32_align1:			; AVX-LABEL: test_zero_v16i32_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16i32_align1:			; AVX512-LABEL: test_zero_v16i32_align1:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 56(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 48(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 40(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 32(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 24(%rdi)
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512-NEXT: movntiq %rax, 16(%rdi)
				; AVX512-NEXT: movq (%rsp), %rax
				; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512-NEXT: movntiq %rax, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 1, !nontemporal !1			store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v32i16_align1(<32 x i16>* %dst) nounwind {			define void @test_zero_v32i16_align1(<32 x i16>* %dst) nounwind {
	; SSE-LABEL: test_zero_v32i16_align1:			; SSE-LABEL: test_zero_v32i16_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v32i16_align1:			; AVX-LABEL: test_zero_v32i16_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512DQ-LABEL: test_zero_v32i16_align1:			; AVX512DQ-LABEL: test_zero_v32i16_align1:
	; AVX512DQ: # %bb.0:			; AVX512DQ: # %bb.0:
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v32i16_align1:			; AVX512BW-LABEL: test_zero_v32i16_align1:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 56(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 48(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 40(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 32(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 24(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 16(%rdi)
				; AVX512BW-NEXT: movq (%rsp), %rax
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512BW-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512BW-NEXT: movntiq %rax, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 1, !nontemporal !1			store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v64i8_align1(<64 x i8>* %dst) nounwind {			define void @test_zero_v64i8_align1(<64 x i8>* %dst) nounwind {
	; SSE-LABEL: test_zero_v64i8_align1:			; SSE-LABEL: test_zero_v64i8_align1:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: movups %xmm0, 48(%rdi)			; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; SSE-NEXT: movups %xmm0, 32(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; SSE-NEXT: movups %xmm0, 16(%rdi)			; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
	; SSE-NEXT: movups %xmm0, (%rdi)			; SSE-NEXT: movntiq %rcx, 56(%rdi)
				; SSE-NEXT: movntiq %rax, 48(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 40(%rdi)
				; SSE-NEXT: movntiq %rax, 32(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 24(%rdi)
				; SSE-NEXT: movntiq %rax, 16(%rdi)
				; SSE-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rax
				; SSE-NEXT: movq -{{[0-9]+}}(%rsp), %rcx
				; SSE-NEXT: movntiq %rcx, 8(%rdi)
				; SSE-NEXT: movntiq %rax, (%rdi)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: test_zero_v64i8_align1:			; AVX-LABEL: test_zero_v64i8_align1:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512DQ-LABEL: test_zero_v64i8_align1:			; AVX512DQ-LABEL: test_zero_v64i8_align1:
	; AVX512DQ: # %bb.0:			; AVX512DQ: # %bb.0:
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v64i8_align1:			; AVX512BW-LABEL: test_zero_v64i8_align1:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 56(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 48(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 40(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 32(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 24(%rdi)
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; AVX512BW-NEXT: movntiq %rax, 16(%rdi)
				; AVX512BW-NEXT: movq (%rsp), %rax
				; AVX512BW-NEXT: movq {{[0-9]+}}(%rsp), %rcx
				; AVX512BW-NEXT: movntiq %rcx, 8(%rdi)
				; AVX512BW-NEXT: movntiq %rax, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 1, !nontemporal !1			store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 1, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8f64_align16(<8 x double>* %dst) nounwind {			define void @test_zero_v8f64_align16(<8 x double>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8f64_align16:			; SSE-LABEL: test_zero_v8f64_align16:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8f64_align16:			; AVX512-LABEL: test_zero_v8f64_align16:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %xmm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x double> zeroinitializer, <8 x double>* %dst, align 16, !nontemporal !1			store <8 x double> zeroinitializer, <8 x double>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16f32_align16(<16 x float>* %dst) nounwind {			define void @test_zero_v16f32_align16(<16 x float>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16f32_align16:			; SSE-LABEL: test_zero_v16f32_align16:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16f32_align16:			; AVX512-LABEL: test_zero_v16f32_align16:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %xmm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x float> zeroinitializer, <16 x float>* %dst, align 16, !nontemporal !1			store <16 x float> zeroinitializer, <16 x float>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8i64_align16(<8 x i64>* %dst) nounwind {			define void @test_zero_v8i64_align16(<8 x i64>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8i64_align16:			; SSE-LABEL: test_zero_v8i64_align16:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8i64_align16:			; AVX512-LABEL: test_zero_v8i64_align16:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %xmm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 16, !nontemporal !1			store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16i32_align16(<16 x i32>* %dst) nounwind {			define void @test_zero_v16i32_align16(<16 x i32>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16i32_align16:			; SSE-LABEL: test_zero_v16i32_align16:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16i32_align16:			; AVX512-LABEL: test_zero_v16i32_align16:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %xmm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 16, !nontemporal !1			store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v32i16_align16(<32 x i16>* %dst) nounwind {			define void @test_zero_v32i16_align16(<32 x i16>* %dst) nounwind {
	; SSE-LABEL: test_zero_v32i16_align16:			; SSE-LABEL: test_zero_v32i16_align16:
	Show All 18 Lines
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v32i16_align16:			; AVX512BW-LABEL: test_zero_v32i16_align16:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: vmovaps (%rsp), %xmm0
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512BW-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 16, !nontemporal !1			store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v64i8_align16(<64 x i8>* %dst) nounwind {			define void @test_zero_v64i8_align16(<64 x i8>* %dst) nounwind {
	; SSE-LABEL: test_zero_v64i8_align16:			; SSE-LABEL: test_zero_v64i8_align16:
	Show All 18 Lines
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovups %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v64i8_align16:			; AVX512BW-LABEL: test_zero_v64i8_align16:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: vmovaps (%rsp), %xmm0
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm1
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm2
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %xmm3
				; AVX512BW-NEXT: vmovntps %xmm3, 48(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm2, 32(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm1, 16(%rdi)
				; AVX512BW-NEXT: vmovntps %xmm0, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 16, !nontemporal !1			store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 16, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8f64_align32(<8 x double>* %dst) nounwind {			define void @test_zero_v8f64_align32(<8 x double>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8f64_align32:			; SSE-LABEL: test_zero_v8f64_align32:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX-NEXT: vmovntps %ymm0, (%rdi)			; AVX-NEXT: vmovntps %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8f64_align32:			; AVX512-LABEL: test_zero_v8f64_align32:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %ymm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x double> zeroinitializer, <8 x double>* %dst, align 32, !nontemporal !1			store <8 x double> zeroinitializer, <8 x double>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16f32_align32(<16 x float>* %dst) nounwind {			define void @test_zero_v16f32_align32(<16 x float>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16f32_align32:			; SSE-LABEL: test_zero_v16f32_align32:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX-NEXT: vmovntps %ymm0, (%rdi)			; AVX-NEXT: vmovntps %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16f32_align32:			; AVX512-LABEL: test_zero_v16f32_align32:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %ymm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x float> zeroinitializer, <16 x float>* %dst, align 32, !nontemporal !1			store <16 x float> zeroinitializer, <16 x float>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v8i64_align32(<8 x i64>* %dst) nounwind {			define void @test_zero_v8i64_align32(<8 x i64>* %dst) nounwind {
	; SSE-LABEL: test_zero_v8i64_align32:			; SSE-LABEL: test_zero_v8i64_align32:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX-NEXT: vmovntps %ymm0, (%rdi)			; AVX-NEXT: vmovntps %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v8i64_align32:			; AVX512-LABEL: test_zero_v8i64_align32:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %ymm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 32, !nontemporal !1			store <8 x i64> zeroinitializer, <8 x i64>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v16i32_align32(<16 x i32>* %dst) nounwind {			define void @test_zero_v16i32_align32(<16 x i32>* %dst) nounwind {
	; SSE-LABEL: test_zero_v16i32_align32:			; SSE-LABEL: test_zero_v16i32_align32:
	Show All 10 Lines
	; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX-NEXT: vmovntps %ymm0, (%rdi)			; AVX-NEXT: vmovntps %ymm0, (%rdi)
	; AVX-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512-LABEL: test_zero_v16i32_align32:			; AVX512-LABEL: test_zero_v16i32_align32:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512-NEXT: pushq %rbp
	; AVX512-NEXT: vmovups %zmm0, (%rdi)			; AVX512-NEXT: movq %rsp, %rbp
				; AVX512-NEXT: andq $-64, %rsp
				; AVX512-NEXT: subq $128, %rsp
				; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512-NEXT: vmovaps (%rsp), %ymm0
				; AVX512-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512-NEXT: movq %rbp, %rsp
				; AVX512-NEXT: popq %rbp
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 32, !nontemporal !1			store <16 x i32> zeroinitializer, <16 x i32>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v32i16_align32(<32 x i16>* %dst) nounwind {			define void @test_zero_v32i16_align32(<32 x i16>* %dst) nounwind {
	; SSE-LABEL: test_zero_v32i16_align32:			; SSE-LABEL: test_zero_v32i16_align32:
	Show All 18 Lines
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovntps %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovntps %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v32i16_align32:			; AVX512BW-LABEL: test_zero_v32i16_align32:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: vmovaps (%rsp), %ymm0
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512BW-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512BW-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 32, !nontemporal !1			store <32 x i16> zeroinitializer, <32 x i16>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	define void @test_zero_v64i8_align32(<64 x i8>* %dst) nounwind {			define void @test_zero_v64i8_align32(<64 x i8>* %dst) nounwind {
	; SSE-LABEL: test_zero_v64i8_align32:			; SSE-LABEL: test_zero_v64i8_align32:
	Show All 18 Lines
	; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512DQ-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512DQ-NEXT: vmovntps %ymm0, 32(%rdi)			; AVX512DQ-NEXT: vmovntps %ymm0, 32(%rdi)
	; AVX512DQ-NEXT: vmovntps %ymm0, (%rdi)			; AVX512DQ-NEXT: vmovntps %ymm0, (%rdi)
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_zero_v64i8_align32:			; AVX512BW-LABEL: test_zero_v64i8_align32:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-64, %rsp
				; AVX512BW-NEXT: subq $128, %rsp
	; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: vmovups %zmm0, (%rdi)			; AVX512BW-NEXT: vmovaps %zmm0, (%rsp)
				; AVX512BW-NEXT: vmovaps (%rsp), %ymm0
				; AVX512BW-NEXT: vmovaps {{[0-9]+}}(%rsp), %ymm1
				; AVX512BW-NEXT: vmovntps %ymm1, 32(%rdi)
				; AVX512BW-NEXT: vmovntps %ymm0, (%rdi)
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 32, !nontemporal !1			store <64 x i8> zeroinitializer, <64 x i8>* %dst, align 32, !nontemporal !1
	ret void			ret void
	}			}

	!1 = !{i32 1}			!1 = !{i32 1}