This is an archive of the discontinued LLVM Phabricator instance.

[x86] avoid intermediate splat for non-zero memsets (PR27100)
ClosedPublic

Authored by spatel on Mar 31 2016, 2:22 PM.

Download Raw Diff

Details

Reviewers

RKSimon
zansari
hjl.tools

Commits

rGa05e0ff22376: [x86] avoid intermediate splat for non-zero memsets (PR27100)
rL265148: [x86] avoid intermediate splat for non-zero memsets (PR27100)

Summary

Follow-up to D18566 - where we noticed that an intermediate splat was being generated for memsets of non-zero chars.

That was because we told getMemsetStores() to use a 32-bit vector element type, and it happily obliged by producing that constant using an integer multiply.

The tests that were added in the last patch are now equivalent for AVX1 and AVX2 (no splats, just a vector load), but we have PR27141 to track that splat difference. In the new tests, the splat via shuffling looks ok to me, but there might be some room for improvement depending on uarch there.

Note that I didn't change the SSE1/2 paths in this patch. I will follow-up with that patch next. This patch should resolve PR27100.

Diff Detail

Event Timeline

spatel updated this revision to Diff 52287.Mar 31 2016, 2:22 PM

spatel retitled this revision from to [x86] avoid intermediate splat for non-zero memsets (PR27100).

spatel updated this object.

spatel added reviewers: zansari, RKSimon, hjl.tools.

spatel added a subscriber: llvm-commits.

Herald added a subscriber: mcrosier. · View Herald TranscriptMar 31 2016, 2:22 PM

LGTM - one comment

lib/Target/X86/X86ISelLowering.cpp
2035	This /is/ a legal type for AVX1 - the trouble is we can't do much with it.

This revision is now accepted and ready to land.Apr 1 2016, 4:12 AM

andreadb added a subscriber: andreadb.Apr 1 2016, 4:24 AM

Nice patch Sanjay.

test/CodeGen/X86/memset-nonzero.ll
94	I noticed that on AVX we now always generate a vmovaps to load a vector of constants. That's obviously fine. However, I wonder if a vbroadcastss would be more appropriate in this case as it would use a smaller constant (for code size only - in this example we would save 28 bytes).

RKSimon added inline comments.Apr 1 2016, 4:56 AM

test/CodeGen/X86/memset-nonzero.ll
94	This is what is being discussed on PR27141 - its proving tricky to determine when the broadcast is worth it and when it will cause register pressure issues.

spatel added inline comments.Apr 1 2016, 7:25 AM

lib/Target/X86/X86ISelLowering.cpp
2035	Thanks - I'll fix that comment before committing.
test/CodeGen/X86/memset-nonzero.ll
94	The difference is actually on the AVX2 side; AVX1 was loading a vector already just with a different format (v4f32). I know we've discussed the splat load vs. vector load trade-off before. Let's follow up in PR27141, or I'll open another report where we can see the problem that Simon mentioned directly.

Closed by commit rL265148: [x86] avoid intermediate splat for non-zero memsets (PR27100) (authored by spatel). · Explain WhyApr 1 2016, 9:32 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

14 lines

test/

CodeGen/

X86/

memset-nonzero.ll

179 lines

Diff 52287

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,024 Lines • ▼ Show 20 Lines	X86TargetLowering::getOptimalMemOpType(uint64_t Size,
bool MemcpyStrSrc,		bool MemcpyStrSrc,
MachineFunction &MF) const {		MachineFunction &MF) const {
const Function *F = MF.getFunction();		const Function *F = MF.getFunction();
if (!F->hasFnAttribute(Attribute::NoImplicitFloat)) {		if (!F->hasFnAttribute(Attribute::NoImplicitFloat)) {
if (Size >= 16 &&		if (Size >= 16 &&
(!Subtarget.isUnalignedMem16Slow() \|\|		(!Subtarget.isUnalignedMem16Slow() \|\|
((DstAlign == 0 \|\| DstAlign >= 16) &&		((DstAlign == 0 \|\| DstAlign >= 16) &&
(SrcAlign == 0 \|\| SrcAlign >= 16)))) {		(SrcAlign == 0 \|\| SrcAlign >= 16)))) {
if (Size >= 32) {
// FIXME: Check if unaligned 32-byte accesses are slow.		// FIXME: Check if unaligned 32-byte accesses are slow.
if (Subtarget.hasInt256())		if (Size >= 32 && Subtarget.hasAVX()) {
return MVT::v8i32;		// Although this isn't a legal type for AVX1, we'll let legalization
		RKSimonUnsubmitted Not Done Reply Inline Actions This /is/ a legal type for AVX1 - the trouble is we can't do much with it. RKSimon: This /is/ a legal type for AVX1 - the trouble is we can't do much with it.
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Thanks - I'll fix that comment before committing. spatel: Thanks - I'll fix that comment before committing.
if (Subtarget.hasFp256())		// and shuffle lowering produce the optimal codegen. If we choose
return MVT::v8f32;		// an optimal type with a vector element larger than a byte, memset can
		// get an intermediate splat (using an integer multiply) before we splat
		// as a vector.
		return MVT::v32i8;
}		}
if (Subtarget.hasSSE2())		if (Subtarget.hasSSE2())
return MVT::v4i32;		return MVT::v4i32;
if (Subtarget.hasSSE1())		if (Subtarget.hasSSE1())
return MVT::v4f32;		return MVT::v4f32;
} else if ((!IsMemset \|\| ZeroMemset) && !MemcpyStrSrc && Size >= 8 &&		} else if ((!IsMemset \|\| ZeroMemset) && !MemcpyStrSrc && Size >= 8 &&
!Subtarget.is64Bit() && Subtarget.hasSSE2()) {		!Subtarget.is64Bit() && Subtarget.hasSSE2()) {
// Do not use f64 to lower memcpy if source is string constant. It's		// Do not use f64 to lower memcpy if source is string constant. It's
▲ Show 20 Lines • Show All 28,361 Lines • Show Last 20 Lines

test/CodeGen/X86/memset-nonzero.ll

	Show All 29 Lines
	; SSE2-LABEL: memset_32_nonzero_bytes:			; SSE2-LABEL: memset_32_nonzero_bytes:
	; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A			; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A
	; SSE2-NEXT: movq %rax, 24(%rdi)			; SSE2-NEXT: movq %rax, 24(%rdi)
	; SSE2-NEXT: movq %rax, 16(%rdi)			; SSE2-NEXT: movq %rax, 16(%rdi)
	; SSE2-NEXT: movq %rax, 8(%rdi)			; SSE2-NEXT: movq %rax, 8(%rdi)
	; SSE2-NEXT: movq %rax, (%rdi)			; SSE2-NEXT: movq %rax, (%rdi)
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; AVX1-LABEL: memset_32_nonzero_bytes:			; AVX-LABEL: memset_32_nonzero_bytes:
	; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]			; AVX: vmovaps {{.*#+}} ymm0 = [42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42]
	; AVX1-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX1-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX-NEXT: retq
	;
	; AVX2-LABEL: memset_32_nonzero_bytes:
	; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
	; AVX2-NEXT: vmovups %ymm0, (%rdi)
	; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 32, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 32, i64 -1)
	ret void			ret void
	}			}

	define void @memset_64_nonzero_bytes(i8* %x) {			define void @memset_64_nonzero_bytes(i8* %x) {
	; SSE2-LABEL: memset_64_nonzero_bytes:			; SSE2-LABEL: memset_64_nonzero_bytes:
	; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A			; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A
	; SSE2-NEXT: movq %rax, 56(%rdi)			; SSE2-NEXT: movq %rax, 56(%rdi)
	; SSE2-NEXT: movq %rax, 48(%rdi)			; SSE2-NEXT: movq %rax, 48(%rdi)
	; SSE2-NEXT: movq %rax, 40(%rdi)			; SSE2-NEXT: movq %rax, 40(%rdi)
	; SSE2-NEXT: movq %rax, 32(%rdi)			; SSE2-NEXT: movq %rax, 32(%rdi)
	; SSE2-NEXT: movq %rax, 24(%rdi)			; SSE2-NEXT: movq %rax, 24(%rdi)
	; SSE2-NEXT: movq %rax, 16(%rdi)			; SSE2-NEXT: movq %rax, 16(%rdi)
	; SSE2-NEXT: movq %rax, 8(%rdi)			; SSE2-NEXT: movq %rax, 8(%rdi)
	; SSE2-NEXT: movq %rax, (%rdi)			; SSE2-NEXT: movq %rax, (%rdi)
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; AVX1-LABEL: memset_64_nonzero_bytes:			; AVX-LABEL: memset_64_nonzero_bytes:
	; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]			; AVX: vmovaps {{.*#+}} ymm0 = [42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42]
	; AVX1-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX1-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX1-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX-NEXT: retq
	;
	; AVX2-LABEL: memset_64_nonzero_bytes:
	; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
	; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX2-NEXT: vmovups %ymm0, (%rdi)
	; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 64, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 64, i64 -1)
	ret void			ret void
	}			}

	define void @memset_128_nonzero_bytes(i8* %x) {			define void @memset_128_nonzero_bytes(i8* %x) {
	; SSE2-LABEL: memset_128_nonzero_bytes:			; SSE2-LABEL: memset_128_nonzero_bytes:
	; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A			; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A
	Show All 10 Lines
	; SSE2-NEXT: movq %rax, 40(%rdi)			; SSE2-NEXT: movq %rax, 40(%rdi)
	; SSE2-NEXT: movq %rax, 32(%rdi)			; SSE2-NEXT: movq %rax, 32(%rdi)
	; SSE2-NEXT: movq %rax, 24(%rdi)			; SSE2-NEXT: movq %rax, 24(%rdi)
	; SSE2-NEXT: movq %rax, 16(%rdi)			; SSE2-NEXT: movq %rax, 16(%rdi)
	; SSE2-NEXT: movq %rax, 8(%rdi)			; SSE2-NEXT: movq %rax, 8(%rdi)
	; SSE2-NEXT: movq %rax, (%rdi)			; SSE2-NEXT: movq %rax, (%rdi)
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; AVX1-LABEL: memset_128_nonzero_bytes:			; AVX-LABEL: memset_128_nonzero_bytes:
	; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]			; AVX: vmovaps {{.*#+}} ymm0 = [42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42]
				andreadbUnsubmitted Not Done Reply Inline Actions I noticed that on AVX we now always generate a vmovaps to load a vector of constants. That's obviously fine. However, I wonder if a vbroadcastss would be more appropriate in this case as it would use a smaller constant (for code size only - in this example we would save 28 bytes). andreadb: I noticed that on AVX we now always generate a vmovaps to load a vector of constants. That's…
				RKSimonUnsubmitted Not Done Reply Inline Actions This is what is being discussed on PR27141 - its proving tricky to determine when the broadcast is worth it and when it will cause register pressure issues. RKSimon: This is what is being discussed on PR27141 - its proving tricky to determine when the broadcast…
				spatelAuthorUnsubmitted Not Done Reply Inline Actions The difference is actually on the AVX2 side; AVX1 was loading a vector already just with a different format (v4f32). I know we've discussed the splat load vs. vector load trade-off before. Let's follow up in PR27141, or I'll open another report where we can see the problem that Simon mentioned directly. spatel: The difference is actually on the AVX2 side; AVX1 was loading a vector already just with a…
	; AVX1-NEXT: vmovups %ymm0, 96(%rdi)			; AVX-NEXT: vmovups %ymm0, 96(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 64(%rdi)			; AVX-NEXT: vmovups %ymm0, 64(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX1-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX1-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX-NEXT: retq
	;
	; AVX2-LABEL: memset_128_nonzero_bytes:
	; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
	; AVX2-NEXT: vmovups %ymm0, 96(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 64(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX2-NEXT: vmovups %ymm0, (%rdi)
	; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 128, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 128, i64 -1)
	ret void			ret void
	}			}

	define void @memset_256_nonzero_bytes(i8* %x) {			define void @memset_256_nonzero_bytes(i8* %x) {
	; SSE2-LABEL: memset_256_nonzero_bytes:			; SSE2-LABEL: memset_256_nonzero_bytes:
	; SSE2: pushq %rax			; SSE2: pushq %rax
	; SSE2-NEXT: .Ltmp0:			; SSE2-NEXT: .Ltmp0:
	; SSE2-NEXT: .cfi_def_cfa_offset 16			; SSE2-NEXT: .cfi_def_cfa_offset 16
	; SSE2-NEXT: movl $42, %esi			; SSE2-NEXT: movl $42, %esi
	; SSE2-NEXT: movl $256, %edx # imm = 0x100			; SSE2-NEXT: movl $256, %edx # imm = 0x100
	; SSE2-NEXT: callq memset			; SSE2-NEXT: callq memset
	; SSE2-NEXT: popq %rax			; SSE2-NEXT: popq %rax
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; AVX1-LABEL: memset_256_nonzero_bytes:			; AVX-LABEL: memset_256_nonzero_bytes:
	; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]			; AVX: vmovaps {{.*#+}} ymm0 = [42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42]
	; AVX1-NEXT: vmovups %ymm0, 224(%rdi)			; AVX-NEXT: vmovups %ymm0, 224(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 192(%rdi)			; AVX-NEXT: vmovups %ymm0, 192(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 160(%rdi)			; AVX-NEXT: vmovups %ymm0, 160(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 128(%rdi)			; AVX-NEXT: vmovups %ymm0, 128(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 96(%rdi)			; AVX-NEXT: vmovups %ymm0, 96(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 64(%rdi)			; AVX-NEXT: vmovups %ymm0, 64(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 32(%rdi)			; AVX-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX1-NEXT: vmovups %ymm0, (%rdi)			; AVX-NEXT: vmovups %ymm0, (%rdi)
	; AVX1-NEXT: vzeroupper			; AVX-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX-NEXT: retq
	;
	; AVX2-LABEL: memset_256_nonzero_bytes:
	; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
	; AVX2-NEXT: vmovups %ymm0, 224(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 192(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 160(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 128(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 96(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 64(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX2-NEXT: vmovups %ymm0, (%rdi)
	; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 256, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 256, i64 -1)
	ret void			ret void
	}			}

	declare i8* @__memset_chk(i8*, i32, i64, i64)			declare i8* @__memset_chk(i8*, i32, i64, i64)

	; Repeat with a non-constant value for the stores.			; Repeat with a non-constant value for the stores.
	Show All 34 Lines
	; SSE2-NEXT: imulq %rax, %rcx			; SSE2-NEXT: imulq %rax, %rcx
	; SSE2-NEXT: movq %rcx, 24(%rdi)			; SSE2-NEXT: movq %rcx, 24(%rdi)
	; SSE2-NEXT: movq %rcx, 16(%rdi)			; SSE2-NEXT: movq %rcx, 16(%rdi)
	; SSE2-NEXT: movq %rcx, 8(%rdi)			; SSE2-NEXT: movq %rcx, 8(%rdi)
	; SSE2-NEXT: movq %rcx, (%rdi)			; SSE2-NEXT: movq %rcx, (%rdi)
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; AVX1-LABEL: memset_32_nonconst_bytes:			; AVX1-LABEL: memset_32_nonconst_bytes:
	; AVX1: movzbl %sil, %eax			; AVX1: vmovd %esi, %xmm0
	; AVX1-NEXT: imull $16843009, %eax, %eax # imm = 0x1010101			; AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; AVX1-NEXT: vmovd %eax, %xmm0			; AVX1-NEXT: vpshufb %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,0,0,0]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: vmovups %ymm0, (%rdi)			; AVX1-NEXT: vmovups %ymm0, (%rdi)
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: memset_32_nonconst_bytes:			; AVX2-LABEL: memset_32_nonconst_bytes:
	; AVX2: movzbl %sil, %eax			; AVX2: vmovd %esi, %xmm0
	; AVX2-NEXT: imull $16843009, %eax, %eax # imm = 0x1010101			; AVX2-NEXT: vpbroadcastb %xmm0, %ymm0
	; AVX2-NEXT: vmovd %eax, %xmm0			; AVX2-NEXT: vmovdqu %ymm0, (%rdi)
	; AVX2-NEXT: vbroadcastss %xmm0, %ymm0
	; AVX2-NEXT: vmovups %ymm0, (%rdi)
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	tail call void @llvm.memset.p0i8.i64(i8* %x, i8 %c, i64 32, i32 1, i1 false)			tail call void @llvm.memset.p0i8.i64(i8* %x, i8 %c, i64 32, i32 1, i1 false)
	ret void			ret void
	}			}

	define void @memset_64_nonconst_bytes(i8* %x, i8 %c) {			define void @memset_64_nonconst_bytes(i8* %x, i8 %c) {
	; SSE2-LABEL: memset_64_nonconst_bytes:			; SSE2-LABEL: memset_64_nonconst_bytes:
	; SSE2: movzbl %sil, %eax			; SSE2: movzbl %sil, %eax
	; SSE2-NEXT: movabsq $72340172838076673, %rcx # imm = 0x101010101010101			; SSE2-NEXT: movabsq $72340172838076673, %rcx # imm = 0x101010101010101
	; SSE2-NEXT: imulq %rax, %rcx			; SSE2-NEXT: imulq %rax, %rcx
	; SSE2-NEXT: movq %rcx, 56(%rdi)			; SSE2-NEXT: movq %rcx, 56(%rdi)
	; SSE2-NEXT: movq %rcx, 48(%rdi)			; SSE2-NEXT: movq %rcx, 48(%rdi)
	; SSE2-NEXT: movq %rcx, 40(%rdi)			; SSE2-NEXT: movq %rcx, 40(%rdi)
	; SSE2-NEXT: movq %rcx, 32(%rdi)			; SSE2-NEXT: movq %rcx, 32(%rdi)
	; SSE2-NEXT: movq %rcx, 24(%rdi)			; SSE2-NEXT: movq %rcx, 24(%rdi)
	; SSE2-NEXT: movq %rcx, 16(%rdi)			; SSE2-NEXT: movq %rcx, 16(%rdi)
	; SSE2-NEXT: movq %rcx, 8(%rdi)			; SSE2-NEXT: movq %rcx, 8(%rdi)
	; SSE2-NEXT: movq %rcx, (%rdi)			; SSE2-NEXT: movq %rcx, (%rdi)
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; AVX1-LABEL: memset_64_nonconst_bytes:			; AVX1-LABEL: memset_64_nonconst_bytes:
	; AVX1: movzbl %sil, %eax			; AVX1: vmovd %esi, %xmm0
	; AVX1-NEXT: imull $16843009, %eax, %eax # imm = 0x1010101			; AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; AVX1-NEXT: vmovd %eax, %xmm0			; AVX1-NEXT: vpshufb %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,0,0,0]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: vmovups %ymm0, 32(%rdi)			; AVX1-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX1-NEXT: vmovups %ymm0, (%rdi)			; AVX1-NEXT: vmovups %ymm0, (%rdi)
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: memset_64_nonconst_bytes:			; AVX2-LABEL: memset_64_nonconst_bytes:
	; AVX2: movzbl %sil, %eax			; AVX2: vmovd %esi, %xmm0
	; AVX2-NEXT: imull $16843009, %eax, %eax # imm = 0x1010101			; AVX2-NEXT: vpbroadcastb %xmm0, %ymm0
	; AVX2-NEXT: vmovd %eax, %xmm0			; AVX2-NEXT: vmovdqu %ymm0, 32(%rdi)
	; AVX2-NEXT: vbroadcastss %xmm0, %ymm0			; AVX2-NEXT: vmovdqu %ymm0, (%rdi)
	; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX2-NEXT: vmovups %ymm0, (%rdi)
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	tail call void @llvm.memset.p0i8.i64(i8* %x, i8 %c, i64 64, i32 1, i1 false)			tail call void @llvm.memset.p0i8.i64(i8* %x, i8 %c, i64 64, i32 1, i1 false)
	ret void			ret void
	}			}

	define void @memset_128_nonconst_bytes(i8* %x, i8 %c) {			define void @memset_128_nonconst_bytes(i8* %x, i8 %c) {
	Show All 15 Lines
	; SSE2-NEXT: movq %rcx, 32(%rdi)			; SSE2-NEXT: movq %rcx, 32(%rdi)
	; SSE2-NEXT: movq %rcx, 24(%rdi)			; SSE2-NEXT: movq %rcx, 24(%rdi)
	; SSE2-NEXT: movq %rcx, 16(%rdi)			; SSE2-NEXT: movq %rcx, 16(%rdi)
	; SSE2-NEXT: movq %rcx, 8(%rdi)			; SSE2-NEXT: movq %rcx, 8(%rdi)
	; SSE2-NEXT: movq %rcx, (%rdi)			; SSE2-NEXT: movq %rcx, (%rdi)
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; AVX1-LABEL: memset_128_nonconst_bytes:			; AVX1-LABEL: memset_128_nonconst_bytes:
	; AVX1: movzbl %sil, %eax			; AVX1: vmovd %esi, %xmm0
	; AVX1-NEXT: imull $16843009, %eax, %eax # imm = 0x1010101			; AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; AVX1-NEXT: vmovd %eax, %xmm0			; AVX1-NEXT: vpshufb %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,0,0,0]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: vmovups %ymm0, 96(%rdi)			; AVX1-NEXT: vmovups %ymm0, 96(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 64(%rdi)			; AVX1-NEXT: vmovups %ymm0, 64(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 32(%rdi)			; AVX1-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX1-NEXT: vmovups %ymm0, (%rdi)			; AVX1-NEXT: vmovups %ymm0, (%rdi)
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: memset_128_nonconst_bytes:			; AVX2-LABEL: memset_128_nonconst_bytes:
	; AVX2: movzbl %sil, %eax			; AVX2: vmovd %esi, %xmm0
	; AVX2-NEXT: imull $16843009, %eax, %eax # imm = 0x1010101			; AVX2-NEXT: vpbroadcastb %xmm0, %ymm0
	; AVX2-NEXT: vmovd %eax, %xmm0			; AVX2-NEXT: vmovdqu %ymm0, 96(%rdi)
	; AVX2-NEXT: vbroadcastss %xmm0, %ymm0			; AVX2-NEXT: vmovdqu %ymm0, 64(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 96(%rdi)			; AVX2-NEXT: vmovdqu %ymm0, 32(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 64(%rdi)			; AVX2-NEXT: vmovdqu %ymm0, (%rdi)
	; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX2-NEXT: vmovups %ymm0, (%rdi)
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	tail call void @llvm.memset.p0i8.i64(i8* %x, i8 %c, i64 128, i32 1, i1 false)			tail call void @llvm.memset.p0i8.i64(i8* %x, i8 %c, i64 128, i32 1, i1 false)
	ret void			ret void
	}			}

	define void @memset_256_nonconst_bytes(i8* %x, i8 %c) {			define void @memset_256_nonconst_bytes(i8* %x, i8 %c) {
	; SSE2-LABEL: memset_256_nonconst_bytes:			; SSE2-LABEL: memset_256_nonconst_bytes:
	; SSE2: movl $256, %edx # imm = 0x100			; SSE2: movl $256, %edx # imm = 0x100
	; SSE2-NEXT: jmp memset # TAILCALL			; SSE2-NEXT: jmp memset # TAILCALL
	;			;
	; AVX1-LABEL: memset_256_nonconst_bytes:			; AVX1-LABEL: memset_256_nonconst_bytes:
	; AVX1: movzbl %sil, %eax			; AVX1: vmovd %esi, %xmm0
	; AVX1-NEXT: imull $16843009, %eax, %eax # imm = 0x1010101			; AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; AVX1-NEXT: vmovd %eax, %xmm0			; AVX1-NEXT: vpshufb %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,0,0,0]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: vmovups %ymm0, 224(%rdi)			; AVX1-NEXT: vmovups %ymm0, 224(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 192(%rdi)			; AVX1-NEXT: vmovups %ymm0, 192(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 160(%rdi)			; AVX1-NEXT: vmovups %ymm0, 160(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 128(%rdi)			; AVX1-NEXT: vmovups %ymm0, 128(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 96(%rdi)			; AVX1-NEXT: vmovups %ymm0, 96(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 64(%rdi)			; AVX1-NEXT: vmovups %ymm0, 64(%rdi)
	; AVX1-NEXT: vmovups %ymm0, 32(%rdi)			; AVX1-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX1-NEXT: vmovups %ymm0, (%rdi)			; AVX1-NEXT: vmovups %ymm0, (%rdi)
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: memset_256_nonconst_bytes:			; AVX2-LABEL: memset_256_nonconst_bytes:
	; AVX2: movzbl %sil, %eax			; AVX2: vmovd %esi, %xmm0
	; AVX2-NEXT: imull $16843009, %eax, %eax # imm = 0x1010101			; AVX2-NEXT: vpbroadcastb %xmm0, %ymm0
	; AVX2-NEXT: vmovd %eax, %xmm0			; AVX2-NEXT: vmovdqu %ymm0, 224(%rdi)
	; AVX2-NEXT: vbroadcastss %xmm0, %ymm0			; AVX2-NEXT: vmovdqu %ymm0, 192(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 224(%rdi)			; AVX2-NEXT: vmovdqu %ymm0, 160(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 192(%rdi)			; AVX2-NEXT: vmovdqu %ymm0, 128(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 160(%rdi)			; AVX2-NEXT: vmovdqu %ymm0, 96(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 128(%rdi)			; AVX2-NEXT: vmovdqu %ymm0, 64(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 96(%rdi)			; AVX2-NEXT: vmovdqu %ymm0, 32(%rdi)
	; AVX2-NEXT: vmovups %ymm0, 64(%rdi)			; AVX2-NEXT: vmovdqu %ymm0, (%rdi)
	; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
	; AVX2-NEXT: vmovups %ymm0, (%rdi)
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	tail call void @llvm.memset.p0i8.i64(i8* %x, i8 %c, i64 256, i32 1, i1 false)			tail call void @llvm.memset.p0i8.i64(i8* %x, i8 %c, i64 256, i32 1, i1 false)
	ret void			ret void
	}			}

	declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i32, i1) #1			declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i32, i1) #1