This is an archive of the discontinued LLVM Phabricator instance.

[x86] use SSE/AVX ops for non-zero memsets (PR27100)
ClosedPublic

Authored by spatel on Mar 29 2016, 11:50 AM.

Download Raw Diff

Details

Reviewers

qcolombet
RKSimon
MatzeB
kbsmith1
zansari
hjl.tools
DavidKreitzer

Commits

rG92d5ea5e07bf: [x86] use SSE/AVX ops for non-zero memsets (PR27100)
rL265029: [x86] use SSE/AVX ops for non-zero memsets (PR27100)

Summary

This is a one-line-of-code patch for:
https://llvm.org/bugs/show_bug.cgi?id=27100

The reasoning that I'm hoping will hold is that we shouldn't discriminate a memset operation from a memcpy at this level because they have exactly the same load/store instruction type requirements.

But as the test cases show, there's some ugliness here:

The i386 (Windows) test expands to use 32 stores instead of a 'rep stosl'. Is that better or worse? (I'm not sure why this change even happens yet.)
The memset-2.ll tests look quite awkward in the way they splat the byte value into an XMM reg; imul isn't generally cheap.
Why do the memset-nonzero.ll tests for an AVX1 target not use vbroadcast like the AVX2 target?
Why does the machine scheduler reorder the DAG nodes? In all cases, we create those store nodes in low-to-high mem address order, but that's not how the machine instructions come out.

I don't think any of the above are big enough problems to prevent this patch from going in first, but we should improve those.

Diff Detail

Event Timeline

spatel updated this revision to Diff 51958.Mar 29 2016, 11:50 AM

spatel retitled this revision from to [x86] use SSE/AVX ops for non-zero memsets (PR27100).

spatel updated this object.

spatel added reviewers: zansari, kbsmith1, hjl.tools, DavidKreitzer, RKSimon, qcolombet, MatzeB.

spatel added a subscriber: llvm-commits.

Herald added a subscriber: mcrosier. · View Herald TranscriptMar 29 2016, 11:50 AM

The memset-2.ll tests look quite awkward in the way they splat the byte value into an XMM reg; imul isn't generally cheap.

This would be my biggest concern out of all of the other ugliness.
I agree that the imul is ugly, but so is all of the other extra code generated to broadcast the byte into an xmm. I'm guessing that this is why the "zeromemset" guard was put there, specifically, to allow memsets with cheap immediates through.

It looks like the code that expands the memset is pretty inefficient. This is also what I see with a memset(m, v, 16) :

movzbl	4(%esp), %ecx
movl	$16843009, %edx         # imm = 0x1010101
movl	%ecx, %eax
mull	%edx
movd	%eax, %xmm0
imull	$16843009, %ecx, %eax   # imm = 0x1010101
addl	%edx, %eax
movd	%eax, %xmm1
punpckldq	%xmm1, %xmm0    # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
movq	%xmm0, a+8
movq	%xmm0, a

.. in addition to all the gackiness, also notice that we're only doing 8B stores after all of that.

I like the change, but any chance we could fix this issue before committing this change? We should really only be generating a couple of shifts/ors and a shuffle, followed by full 16B stores.

Thanks,
Zia.

For 16-byte memset, we can generate

	movzbl	4(%esp), %eax
	imull	$16843009, %eax, %eax
	movd	%eax, %xmm0
	movl	dst, %eax
	pshufd	$0, %xmm0, %xmm0
	movdqu	%xmm0, (%eax)
	ret

Oops. I meant

	movzbl	8(%esp), %eax
	imull	$16843009, %eax, %eax
	movd	%eax, %xmm0
	movl	4(%esp), %eax
	pshufd	$0, %xmm0, %xmm0
	movdqu	%xmm0, (%eax)
	ret

Depending on the cost of imull, you can avoid it with:

movzbl    4(%esp), %ecx
movl      %ecx, %eax
shll      $8, %eax
orl       %eax, %ecx
movl      %ecx, %edx
shll      $16, %edx
orl       %edx, %ecx
movd      %ecx, %xmm0
pshufd    $0, %xmm0, %xmm1

In D18566#386084, @zansari wrote:

.. in addition to all the gackiness, also notice that we're only doing 8B stores after all of that.

I like the change, but any chance we could fix this issue before committing this change? We should really only be generating a couple of shifts/ors and a shuffle, followed by full 16B stores.

Clearly, the one-line patch was too ambitious. :)

What we're seeing in some of these changes is that we're hitting what I hope is a weird corner case: a slow unaligned SSE store implementation (ie, before SSE4.2) with a 32-bit OS. On 2nd thought, maybe that's not so weird.

In any case, I will fix the patch to preserve that existing behavior. By just loosening the restriction on the non-zero memset for fast CPUs, we'll avoid the strange codegen and still get the benefits shown in PR27100.

Patch updated:
Move the memset check down to the slow SSE case: this allows fast targets to take advantage of SSE/AVX instructions and prevents slow targets from stepping into a codegen sinkhole while trying to splat a byte into an XMM reg.

Note that, unlike the previous rev of the patch, all existing regression tests remain unchanged except for the tests that I added to model the request in PR27100.

We still have the questions of AVX1 codegen and unexpected machine scheduler behavior, but I think we can address those separately.

Thanks, Sanjay.. This lgmt.

This revision is now accepted and ready to land.Mar 30 2016, 2:10 PM

Closed by commit rL265029: [x86] use SSE/AVX ops for non-zero memsets (PR27100) (authored by spatel). · Explain WhyMar 31 2016, 10:35 AM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D18676: [x86] avoid intermediate splat for non-zero memsets (PR27100).Mar 31 2016, 2:22 PM

spatel mentioned this in rL265148: [x86] avoid intermediate splat for non-zero memsets (PR27100).Apr 1 2016, 9:32 AM

spatel mentioned this in rL265161: [x86] avoid intermediate splat for non-zero memsets (PR27100).Apr 1 2016, 10:42 AM

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

3 lines

test/

CodeGen/

X86/

mem-intrin-base-reg.ll

37 lines

memset-2.ll

50 lines

memset-nonzero.ll

182 lines

Diff 51958

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 2,004 Lines • ▼ Show 20 Lines
	/// target-independent logic.			/// target-independent logic.
	EVT			EVT
	X86TargetLowering::getOptimalMemOpType(uint64_t Size,			X86TargetLowering::getOptimalMemOpType(uint64_t Size,
	unsigned DstAlign, unsigned SrcAlign,			unsigned DstAlign, unsigned SrcAlign,
	bool IsMemset, bool ZeroMemset,			bool IsMemset, bool ZeroMemset,
	bool MemcpyStrSrc,			bool MemcpyStrSrc,
	MachineFunction &MF) const {			MachineFunction &MF) const {
	const Function *F = MF.getFunction();			const Function *F = MF.getFunction();
	if ((!IsMemset \|\| ZeroMemset) &&			if (!F->hasFnAttribute(Attribute::NoImplicitFloat)) {
	!F->hasFnAttribute(Attribute::NoImplicitFloat)) {
	if (Size >= 16 &&			if (Size >= 16 &&
	(!Subtarget.isUnalignedMem16Slow() \|\|			(!Subtarget.isUnalignedMem16Slow() \|\|
	((DstAlign == 0 \|\| DstAlign >= 16) &&			((DstAlign == 0 \|\| DstAlign >= 16) &&
	(SrcAlign == 0 \|\| SrcAlign >= 16)))) {			(SrcAlign == 0 \|\| SrcAlign >= 16)))) {
	if (Size >= 32) {			if (Size >= 32) {
	// FIXME: Check if unaligned 32-byte accesses are slow.			// FIXME: Check if unaligned 32-byte accesses are slow.
	if (Subtarget.hasInt256())			if (Subtarget.hasInt256())
	return MVT::v8i32;			return MVT::v8i32;
	▲ Show 20 Lines • Show All 28,359 Lines • Show Last 20 Lines

test/CodeGen/X86/mem-intrin-base-reg.ll

Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	spill_vectors:
call void @escape_vla_and_icmp(i8* %vla, i1 zeroext %icmp)		call void @escape_vla_and_icmp(i8* %vla, i1 zeroext %icmp)
%r = extractelement <4 x i32> %v0, i32 0		%r = extractelement <4 x i32> %v0, i32 0
ret i32 %r		ret i32 %r
}		}

; CHECK-LABEL: _memset_vla_vector:		; CHECK-LABEL: _memset_vla_vector:
; CHECK: andl $-16, %esp		; CHECK: andl $-16, %esp
; CHECK: movl %esp, %esi		; CHECK: movl %esp, %esi
; CHECK-DAG: movl $707406378, %eax # imm = 0x2A2A2A2A
; CHECK-DAG: movl $32, %ecx
; CHECK-DAG: movl {{.*}}, %edi
; CHECK-NOT: movl {{.*}}, %esi		; CHECK-NOT: movl {{.*}}, %esi
; CHECK: rep;stosl		; CHECK-DAG: movl 12(%ebp), %ecx
		; CHECK-DAG: movl $707406378, 4(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, (%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 12(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 8(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 20(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 16(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 28(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 24(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 36(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 32(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 44(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 40(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 52(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 48(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 60(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 56(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 68(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 64(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 76(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 72(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 84(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 80(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 92(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 88(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 100(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 96(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 108(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 104(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 116(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 112(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 124(%ecx) # imm = 0x2A2A2A2A
		; CHECK-DAG: movl $707406378, 120(%ecx) # imm = 0x2A2A2A2A

; Add a test for memcmp if we ever add a special lowering for it.		; Add a test for memcmp if we ever add a special lowering for it.

test/CodeGen/X86/memset-2.ll

	Show All 12 Lines
	;			;
	entry:			entry:
	call void @llvm.memset.p0i8.i32(i8* null, i8 0, i32 188, i32 1, i1 false)			call void @llvm.memset.p0i8.i32(i8* null, i8 0, i32 188, i32 1, i1 false)
	unreachable			unreachable
	}			}

	define fastcc void @t2(i8 signext %c) nounwind {			define fastcc void @t2(i8 signext %c) nounwind {
	; CHECK-LABEL: t2:			; CHECK-LABEL: t2:
	; CHECK: subl $12, %esp			; CHECK: movzbl %cl, %eax
	; CHECK-NEXT: movl %ecx, {{[0-9]+}}(%esp)			; CHECK-NEXT: imull $16843009, %eax, %ecx ## imm = 0x1010101
	; CHECK-NEXT: movl $76, {{[0-9]+}}(%esp)			; CHECK-NEXT: movl %ecx, (%eax)
	; CHECK-NEXT: calll L_memset$stub			; CHECK-NEXT: movl $16843009, %edx ## imm = 0x1010101
				; CHECK-NEXT: mull %edx
				; CHECK-NEXT: movd %eax, %xmm0
				; CHECK-NEXT: addl %ecx, %edx
				; CHECK-NEXT: movd %edx, %xmm1
				; CHECK-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
				; CHECK-NEXT: movq %xmm0, (%eax)
	;			;
	entry:			entry:
	call void @llvm.memset.p0i8.i32(i8* undef, i8 %c, i32 76, i32 1, i1 false)			call void @llvm.memset.p0i8.i32(i8* undef, i8 %c, i32 76, i32 1, i1 false)
	unreachable			unreachable
	}			}

	declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i32, i1) nounwind			declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i32, i1) nounwind

	define void @t3(i8* nocapture %s, i8 %a) nounwind {			define void @t3(i8* nocapture %s, i8 %a) nounwind {
	; CHECK-LABEL: t3:			; CHECK-LABEL: t3:
	; CHECK: movl {{[0-9]+}}(%esp), %eax			; CHECK: pushl %esi
				; CHECK-NEXT: movl {{[0-9]+}}(%esp), %esi
	; CHECK-NEXT: movzbl {{[0-9]+}}(%esp), %ecx			; CHECK-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
	; CHECK-NEXT: imull $16843009, %ecx, %ecx ## imm = 0x1010101			; CHECK-NEXT: movl $16843009, %edx ## imm = 0x1010101
	; CHECK-NEXT: movl %ecx, 4(%eax)			; CHECK-NEXT: movl %ecx, %eax
	; CHECK-NEXT: movl %ecx, (%eax)			; CHECK-NEXT: mull %edx
				; CHECK-NEXT: movd %eax, %xmm0
				; CHECK-NEXT: imull $16843009, %ecx, %eax ## imm = 0x1010101
				; CHECK-NEXT: addl %edx, %eax
				; CHECK-NEXT: movd %eax, %xmm1
				; CHECK-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
				; CHECK-NEXT: movq %xmm0, (%esi)
				; CHECK-NEXT: popl %esi
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	;			;
	entry:			entry:
	tail call void @llvm.memset.p0i8.i32(i8* %s, i8 %a, i32 8, i32 1, i1 false)			tail call void @llvm.memset.p0i8.i32(i8* %s, i8 %a, i32 8, i32 1, i1 false)
	ret void			ret void
	}			}

	define void @t4(i8* nocapture %s, i8 %a) nounwind {			define void @t4(i8* nocapture %s, i8 %a) nounwind {
	; CHECK-LABEL: t4:			; CHECK-LABEL: t4:
	; CHECK: movl {{[0-9]+}}(%esp), %eax			; CHECK: pushl %esi
				; CHECK-NEXT: movl {{[0-9]+}}(%esp), %esi
	; CHECK-NEXT: movzbl {{[0-9]+}}(%esp), %ecx			; CHECK-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
	; CHECK-NEXT: imull $16843009, %ecx, %ecx ## imm = 0x1010101			; CHECK-NEXT: movl $16843009, %edx ## imm = 0x1010101
	; CHECK-NEXT: movl %ecx, 8(%eax)			; CHECK-NEXT: movl %ecx, %eax
	; CHECK-NEXT: movl %ecx, 4(%eax)			; CHECK-NEXT: mull %edx
	; CHECK-NEXT: movl %ecx, (%eax)			; CHECK-NEXT: movd %eax, %xmm0
	; CHECK-NEXT: movw %cx, 12(%eax)			; CHECK-NEXT: imull $16843009, %ecx, %eax ## imm = 0x1010101
	; CHECK-NEXT: movb %cl, 14(%eax)			; CHECK-NEXT: addl %edx, %eax
				; CHECK-NEXT: movd %eax, %xmm1
				; CHECK-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
				; CHECK-NEXT: movq %xmm0, 7(%esi)
				; CHECK-NEXT: movq %xmm0, (%esi)
				; CHECK-NEXT: popl %esi
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	;			;
	entry:			entry:
	tail call void @llvm.memset.p0i8.i32(i8* %s, i8 %a, i32 15, i32 1, i1 false)			tail call void @llvm.memset.p0i8.i32(i8* %s, i8 %a, i32 15, i32 1, i1 false)
	ret void			ret void
	}			}

test/CodeGen/X86/memset-nonzero.ll

	; NOTE: Assertions have been autogenerated by update_test_checks.py			; NOTE: Assertions have been autogenerated by update_test_checks.py
	; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=sse2 \| FileCheck %s --check-prefix=ANY --check-prefix=SSE2			; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=sse2 \| FileCheck %s --check-prefix=ANY --check-prefix=SSE2
	; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=avx \| FileCheck %s --check-prefix=ANY --check-prefix=AVX --check-prefix=AVX1			; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=avx \| FileCheck %s --check-prefix=ANY --check-prefix=AVX --check-prefix=AVX1
	; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=avx2 \| FileCheck %s --check-prefix=ANY --check-prefix=AVX --check-prefix=AVX2			; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=avx2 \| FileCheck %s --check-prefix=ANY --check-prefix=AVX --check-prefix=AVX2

	define void @memset_16_nonzero_bytes(i8* %x) {			define void @memset_16_nonzero_bytes(i8* %x) {
	; ANY-LABEL: memset_16_nonzero_bytes:			; SSE2-LABEL: memset_16_nonzero_bytes:
	; ANY: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A			; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A
	; ANY-NEXT: movq %rax, 8(%rdi)			; SSE2-NEXT: movq %rax, 8(%rdi)
	; ANY-NEXT: movq %rax, (%rdi)			; SSE2-NEXT: movq %rax, (%rdi)
	; ANY-NEXT: retq			; SSE2-NEXT: retq
				;
				; AVX1-LABEL: memset_16_nonzero_bytes:
				; AVX1: vmovaps {{.*#+}} xmm0 = [707406378,707406378,707406378,707406378]
				; AVX1-NEXT: vmovups %xmm0, (%rdi)
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: memset_16_nonzero_bytes:
				; AVX2: vbroadcastss {{.*}}(%rip), %xmm0
				; AVX2-NEXT: vmovups %xmm0, (%rdi)
				; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 16, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 16, i64 -1)
	ret void			ret void
	}			}

	define void @memset_32_nonzero_bytes(i8* %x) {			define void @memset_32_nonzero_bytes(i8* %x) {
	; ANY-LABEL: memset_32_nonzero_bytes:			; SSE2-LABEL: memset_32_nonzero_bytes:
	; ANY: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A			; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A
	; ANY-NEXT: movq %rax, 24(%rdi)			; SSE2-NEXT: movq %rax, 24(%rdi)
	; ANY-NEXT: movq %rax, 16(%rdi)			; SSE2-NEXT: movq %rax, 16(%rdi)
	; ANY-NEXT: movq %rax, 8(%rdi)			; SSE2-NEXT: movq %rax, 8(%rdi)
	; ANY-NEXT: movq %rax, (%rdi)			; SSE2-NEXT: movq %rax, (%rdi)
	; ANY-NEXT: retq			; SSE2-NEXT: retq
				;
				; AVX1-LABEL: memset_32_nonzero_bytes:
				; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]
				; AVX1-NEXT: vmovups %ymm0, (%rdi)
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: memset_32_nonzero_bytes:
				; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
				; AVX2-NEXT: vmovups %ymm0, (%rdi)
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 32, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 32, i64 -1)
	ret void			ret void
	}			}

	define void @memset_64_nonzero_bytes(i8* %x) {			define void @memset_64_nonzero_bytes(i8* %x) {
	; ANY-LABEL: memset_64_nonzero_bytes:			; SSE2-LABEL: memset_64_nonzero_bytes:
	; ANY: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A			; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A
	; ANY-NEXT: movq %rax, 56(%rdi)			; SSE2-NEXT: movq %rax, 56(%rdi)
	; ANY-NEXT: movq %rax, 48(%rdi)			; SSE2-NEXT: movq %rax, 48(%rdi)
	; ANY-NEXT: movq %rax, 40(%rdi)			; SSE2-NEXT: movq %rax, 40(%rdi)
	; ANY-NEXT: movq %rax, 32(%rdi)			; SSE2-NEXT: movq %rax, 32(%rdi)
	; ANY-NEXT: movq %rax, 24(%rdi)			; SSE2-NEXT: movq %rax, 24(%rdi)
	; ANY-NEXT: movq %rax, 16(%rdi)			; SSE2-NEXT: movq %rax, 16(%rdi)
	; ANY-NEXT: movq %rax, 8(%rdi)			; SSE2-NEXT: movq %rax, 8(%rdi)
	; ANY-NEXT: movq %rax, (%rdi)			; SSE2-NEXT: movq %rax, (%rdi)
	; ANY-NEXT: retq			; SSE2-NEXT: retq
				;
				; AVX1-LABEL: memset_64_nonzero_bytes:
				; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]
				; AVX1-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX1-NEXT: vmovups %ymm0, (%rdi)
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: memset_64_nonzero_bytes:
				; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
				; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX2-NEXT: vmovups %ymm0, (%rdi)
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 64, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 64, i64 -1)
	ret void			ret void
	}			}

	define void @memset_128_nonzero_bytes(i8* %x) {			define void @memset_128_nonzero_bytes(i8* %x) {
	; ANY-LABEL: memset_128_nonzero_bytes:			; SSE2-LABEL: memset_128_nonzero_bytes:
	; ANY: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A			; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A
	; ANY-NEXT: movq %rax, 120(%rdi)			; SSE2-NEXT: movq %rax, 120(%rdi)
	; ANY-NEXT: movq %rax, 112(%rdi)			; SSE2-NEXT: movq %rax, 112(%rdi)
	; ANY-NEXT: movq %rax, 104(%rdi)			; SSE2-NEXT: movq %rax, 104(%rdi)
	; ANY-NEXT: movq %rax, 96(%rdi)			; SSE2-NEXT: movq %rax, 96(%rdi)
	; ANY-NEXT: movq %rax, 88(%rdi)			; SSE2-NEXT: movq %rax, 88(%rdi)
	; ANY-NEXT: movq %rax, 80(%rdi)			; SSE2-NEXT: movq %rax, 80(%rdi)
	; ANY-NEXT: movq %rax, 72(%rdi)			; SSE2-NEXT: movq %rax, 72(%rdi)
	; ANY-NEXT: movq %rax, 64(%rdi)			; SSE2-NEXT: movq %rax, 64(%rdi)
	; ANY-NEXT: movq %rax, 56(%rdi)			; SSE2-NEXT: movq %rax, 56(%rdi)
	; ANY-NEXT: movq %rax, 48(%rdi)			; SSE2-NEXT: movq %rax, 48(%rdi)
	; ANY-NEXT: movq %rax, 40(%rdi)			; SSE2-NEXT: movq %rax, 40(%rdi)
	; ANY-NEXT: movq %rax, 32(%rdi)			; SSE2-NEXT: movq %rax, 32(%rdi)
	; ANY-NEXT: movq %rax, 24(%rdi)			; SSE2-NEXT: movq %rax, 24(%rdi)
	; ANY-NEXT: movq %rax, 16(%rdi)			; SSE2-NEXT: movq %rax, 16(%rdi)
	; ANY-NEXT: movq %rax, 8(%rdi)			; SSE2-NEXT: movq %rax, 8(%rdi)
	; ANY-NEXT: movq %rax, (%rdi)			; SSE2-NEXT: movq %rax, (%rdi)
	; ANY-NEXT: retq			; SSE2-NEXT: retq
				;
				; AVX1-LABEL: memset_128_nonzero_bytes:
				; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]
				; AVX1-NEXT: vmovups %ymm0, 96(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 64(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX1-NEXT: vmovups %ymm0, (%rdi)
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: memset_128_nonzero_bytes:
				; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
				; AVX2-NEXT: vmovups %ymm0, 96(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 64(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX2-NEXT: vmovups %ymm0, (%rdi)
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 128, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 128, i64 -1)
	ret void			ret void
	}			}

	define void @memset_256_nonzero_bytes(i8* %x) {			define void @memset_256_nonzero_bytes(i8* %x) {
	; ANY-LABEL: memset_256_nonzero_bytes:			; SSE2-LABEL: memset_256_nonzero_bytes:
	; ANY: pushq %rax			; SSE2: pushq %rax
	; ANY-NEXT: .Ltmp0:			; SSE2-NEXT: .Ltmp0:
	; ANY-NEXT: .cfi_def_cfa_offset 16			; SSE2-NEXT: .cfi_def_cfa_offset 16
	; ANY-NEXT: movl $42, %esi			; SSE2-NEXT: movl $42, %esi
	; ANY-NEXT: movl $256, %edx # imm = 0x100			; SSE2-NEXT: movl $256, %edx # imm = 0x100
	; ANY-NEXT: callq memset			; SSE2-NEXT: callq memset
	; ANY-NEXT: popq %rax			; SSE2-NEXT: popq %rax
	; ANY-NEXT: retq			; SSE2-NEXT: retq
				;
				; AVX1-LABEL: memset_256_nonzero_bytes:
				; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]
				; AVX1-NEXT: vmovups %ymm0, 224(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 192(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 160(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 128(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 96(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 64(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX1-NEXT: vmovups %ymm0, (%rdi)
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: memset_256_nonzero_bytes:
				; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
				; AVX2-NEXT: vmovups %ymm0, 224(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 192(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 160(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 128(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 96(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 64(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX2-NEXT: vmovups %ymm0, (%rdi)
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 256, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 256, i64 -1)
	ret void			ret void
	}			}

	declare i8* @__memset_chk(i8*, i32, i64, i64)			declare i8* @__memset_chk(i8*, i32, i64, i64)