This is an archive of the discontinued LLVM Phabricator instance.

[x86] use SSE/AVX ops for non-zero memsets (PR27100)
ClosedPublic

Authored by spatel on Mar 29 2016, 11:50 AM.

Download Raw Diff

Details

Reviewers

qcolombet
RKSimon
MatzeB
kbsmith1
zansari
hjl.tools
DavidKreitzer

Commits

rG92d5ea5e07bf: [x86] use SSE/AVX ops for non-zero memsets (PR27100)
rL265029: [x86] use SSE/AVX ops for non-zero memsets (PR27100)

Summary

This is a one-line-of-code patch for:
https://llvm.org/bugs/show_bug.cgi?id=27100

The reasoning that I'm hoping will hold is that we shouldn't discriminate a memset operation from a memcpy at this level because they have exactly the same load/store instruction type requirements.

But as the test cases show, there's some ugliness here:

The i386 (Windows) test expands to use 32 stores instead of a 'rep stosl'. Is that better or worse? (I'm not sure why this change even happens yet.)
The memset-2.ll tests look quite awkward in the way they splat the byte value into an XMM reg; imul isn't generally cheap.
Why do the memset-nonzero.ll tests for an AVX1 target not use vbroadcast like the AVX2 target?
Why does the machine scheduler reorder the DAG nodes? In all cases, we create those store nodes in low-to-high mem address order, but that's not how the machine instructions come out.

I don't think any of the above are big enough problems to prevent this patch from going in first, but we should improve those.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 51958.Mar 29 2016, 11:50 AM

spatel retitled this revision from to [x86] use SSE/AVX ops for non-zero memsets (PR27100).

spatel updated this object.

spatel added reviewers: zansari, kbsmith1, hjl.tools, DavidKreitzer, RKSimon, qcolombet, MatzeB.

spatel added a subscriber: llvm-commits.

Herald added a subscriber: mcrosier. · View Herald TranscriptMar 29 2016, 11:50 AM

The memset-2.ll tests look quite awkward in the way they splat the byte value into an XMM reg; imul isn't generally cheap.

This would be my biggest concern out of all of the other ugliness.
I agree that the imul is ugly, but so is all of the other extra code generated to broadcast the byte into an xmm. I'm guessing that this is why the "zeromemset" guard was put there, specifically, to allow memsets with cheap immediates through.

It looks like the code that expands the memset is pretty inefficient. This is also what I see with a memset(m, v, 16) :

movzbl	4(%esp), %ecx
movl	$16843009, %edx         # imm = 0x1010101
movl	%ecx, %eax
mull	%edx
movd	%eax, %xmm0
imull	$16843009, %ecx, %eax   # imm = 0x1010101
addl	%edx, %eax
movd	%eax, %xmm1
punpckldq	%xmm1, %xmm0    # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
movq	%xmm0, a+8
movq	%xmm0, a

.. in addition to all the gackiness, also notice that we're only doing 8B stores after all of that.

I like the change, but any chance we could fix this issue before committing this change? We should really only be generating a couple of shifts/ors and a shuffle, followed by full 16B stores.

Thanks,
Zia.

For 16-byte memset, we can generate

	movzbl	4(%esp), %eax
	imull	$16843009, %eax, %eax
	movd	%eax, %xmm0
	movl	dst, %eax
	pshufd	$0, %xmm0, %xmm0
	movdqu	%xmm0, (%eax)
	ret

Oops. I meant

	movzbl	8(%esp), %eax
	imull	$16843009, %eax, %eax
	movd	%eax, %xmm0
	movl	4(%esp), %eax
	pshufd	$0, %xmm0, %xmm0
	movdqu	%xmm0, (%eax)
	ret

Depending on the cost of imull, you can avoid it with:

movzbl    4(%esp), %ecx
movl      %ecx, %eax
shll      $8, %eax
orl       %eax, %ecx
movl      %ecx, %edx
shll      $16, %edx
orl       %edx, %ecx
movd      %ecx, %xmm0
pshufd    $0, %xmm0, %xmm1

In D18566#386084, @zansari wrote:

.. in addition to all the gackiness, also notice that we're only doing 8B stores after all of that.

I like the change, but any chance we could fix this issue before committing this change? We should really only be generating a couple of shifts/ors and a shuffle, followed by full 16B stores.

Clearly, the one-line patch was too ambitious. :)

What we're seeing in some of these changes is that we're hitting what I hope is a weird corner case: a slow unaligned SSE store implementation (ie, before SSE4.2) with a 32-bit OS. On 2nd thought, maybe that's not so weird.

In any case, I will fix the patch to preserve that existing behavior. By just loosening the restriction on the non-zero memset for fast CPUs, we'll avoid the strange codegen and still get the benefits shown in PR27100.

Patch updated:
Move the memset check down to the slow SSE case: this allows fast targets to take advantage of SSE/AVX instructions and prevents slow targets from stepping into a codegen sinkhole while trying to splat a byte into an XMM reg.

Note that, unlike the previous rev of the patch, all existing regression tests remain unchanged except for the tests that I added to model the request in PR27100.

We still have the questions of AVX1 codegen and unexpected machine scheduler behavior, but I think we can address those separately.

Thanks, Sanjay.. This lgmt.

This revision is now accepted and ready to land.Mar 30 2016, 2:10 PM

Closed by commit rL265029: [x86] use SSE/AVX ops for non-zero memsets (PR27100) (authored by spatel). · Explain WhyMar 31 2016, 10:35 AM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D18676: [x86] avoid intermediate splat for non-zero memsets (PR27100).Mar 31 2016, 2:22 PM

spatel mentioned this in rL265148: [x86] avoid intermediate splat for non-zero memsets (PR27100).Apr 1 2016, 9:32 AM

spatel mentioned this in rL265161: [x86] avoid intermediate splat for non-zero memsets (PR27100).Apr 1 2016, 10:42 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

12 lines

test/

CodeGen/

X86/

memset-nonzero.ll

182 lines

Diff 52240

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 2,019 Lines • ▼ Show 20 Lines
	/// target-independent logic.			/// target-independent logic.
	EVT			EVT
	X86TargetLowering::getOptimalMemOpType(uint64_t Size,			X86TargetLowering::getOptimalMemOpType(uint64_t Size,
	unsigned DstAlign, unsigned SrcAlign,			unsigned DstAlign, unsigned SrcAlign,
	bool IsMemset, bool ZeroMemset,			bool IsMemset, bool ZeroMemset,
	bool MemcpyStrSrc,			bool MemcpyStrSrc,
	MachineFunction &MF) const {			MachineFunction &MF) const {
	const Function *F = MF.getFunction();			const Function *F = MF.getFunction();
	if ((!IsMemset \|\| ZeroMemset) &&			if (!F->hasFnAttribute(Attribute::NoImplicitFloat)) {
	!F->hasFnAttribute(Attribute::NoImplicitFloat)) {
	if (Size >= 16 &&			if (Size >= 16 &&
	(!Subtarget.isUnalignedMem16Slow() \|\|			(!Subtarget.isUnalignedMem16Slow() \|\|
	((DstAlign == 0 \|\| DstAlign >= 16) &&			((DstAlign == 0 \|\| DstAlign >= 16) &&
	(SrcAlign == 0 \|\| SrcAlign >= 16)))) {			(SrcAlign == 0 \|\| SrcAlign >= 16)))) {
	if (Size >= 32) {			if (Size >= 32) {
	// FIXME: Check if unaligned 32-byte accesses are slow.			// FIXME: Check if unaligned 32-byte accesses are slow.
	if (Subtarget.hasInt256())			if (Subtarget.hasInt256())
	return MVT::v8i32;			return MVT::v8i32;
	if (Subtarget.hasFp256())			if (Subtarget.hasFp256())
	return MVT::v8f32;			return MVT::v8f32;
	}			}
	if (Subtarget.hasSSE2())			if (Subtarget.hasSSE2())
	return MVT::v4i32;			return MVT::v4i32;
	if (Subtarget.hasSSE1())			if (Subtarget.hasSSE1())
	return MVT::v4f32;			return MVT::v4f32;
	} else if (!MemcpyStrSrc && Size >= 8 &&			} else if ((!IsMemset \|\| ZeroMemset) && !MemcpyStrSrc && Size >= 8 &&
	!Subtarget.is64Bit() &&			!Subtarget.is64Bit() && Subtarget.hasSSE2()) {
	Subtarget.hasSSE2()) {
	// Do not use f64 to lower memcpy if source is string constant. It's			// Do not use f64 to lower memcpy if source is string constant. It's
	// better to use i32 to avoid the loads.			// better to use i32 to avoid the loads.
				// Also, do not use f64 to lower memset unless this is a memset of zeros.
				// The gymnastics of splatting a byte value into an XMM register and then
				// only using 8-byte stores (because this is a CPU with slow unaligned
				// 16-byte accesses) makes that a loser.
	return MVT::f64;			return MVT::f64;
	}			}
	}			}
	// This is a compromise. If we reach here, unaligned accesses may be slow on			// This is a compromise. If we reach here, unaligned accesses may be slow on
	// this target. However, creating smaller, aligned accesses could be even			// this target. However, creating smaller, aligned accesses could be even
	// slower and would certainly be a lot more code.			// slower and would certainly be a lot more code.
	if (Subtarget.is64Bit() && Size >= 8)			if (Subtarget.is64Bit() && Size >= 8)
	return MVT::i64;			return MVT::i64;
	▲ Show 20 Lines • Show All 28,348 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/memset-nonzero.ll

	; NOTE: Assertions have been autogenerated by update_test_checks.py			; NOTE: Assertions have been autogenerated by update_test_checks.py
	; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=sse2 \| FileCheck %s --check-prefix=ANY --check-prefix=SSE2			; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=sse2 \| FileCheck %s --check-prefix=ANY --check-prefix=SSE2
	; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=avx \| FileCheck %s --check-prefix=ANY --check-prefix=AVX --check-prefix=AVX1			; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=avx \| FileCheck %s --check-prefix=ANY --check-prefix=AVX --check-prefix=AVX1
	; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=avx2 \| FileCheck %s --check-prefix=ANY --check-prefix=AVX --check-prefix=AVX2			; RUN: llc -mtriple=x86_64-unknown-unknown < %s -mattr=avx2 \| FileCheck %s --check-prefix=ANY --check-prefix=AVX --check-prefix=AVX2

	define void @memset_16_nonzero_bytes(i8* %x) {			define void @memset_16_nonzero_bytes(i8* %x) {
	; ANY-LABEL: memset_16_nonzero_bytes:			; SSE2-LABEL: memset_16_nonzero_bytes:
	; ANY: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A			; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A
	; ANY-NEXT: movq %rax, 8(%rdi)			; SSE2-NEXT: movq %rax, 8(%rdi)
	; ANY-NEXT: movq %rax, (%rdi)			; SSE2-NEXT: movq %rax, (%rdi)
	; ANY-NEXT: retq			; SSE2-NEXT: retq
				;
				; AVX1-LABEL: memset_16_nonzero_bytes:
				; AVX1: vmovaps {{.*#+}} xmm0 = [707406378,707406378,707406378,707406378]
				; AVX1-NEXT: vmovups %xmm0, (%rdi)
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: memset_16_nonzero_bytes:
				; AVX2: vbroadcastss {{.*}}(%rip), %xmm0
				; AVX2-NEXT: vmovups %xmm0, (%rdi)
				; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 16, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 16, i64 -1)
	ret void			ret void
	}			}

	define void @memset_32_nonzero_bytes(i8* %x) {			define void @memset_32_nonzero_bytes(i8* %x) {
	; ANY-LABEL: memset_32_nonzero_bytes:			; SSE2-LABEL: memset_32_nonzero_bytes:
	; ANY: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A			; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A
	; ANY-NEXT: movq %rax, 24(%rdi)			; SSE2-NEXT: movq %rax, 24(%rdi)
	; ANY-NEXT: movq %rax, 16(%rdi)			; SSE2-NEXT: movq %rax, 16(%rdi)
	; ANY-NEXT: movq %rax, 8(%rdi)			; SSE2-NEXT: movq %rax, 8(%rdi)
	; ANY-NEXT: movq %rax, (%rdi)			; SSE2-NEXT: movq %rax, (%rdi)
	; ANY-NEXT: retq			; SSE2-NEXT: retq
				;
				; AVX1-LABEL: memset_32_nonzero_bytes:
				; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]
				; AVX1-NEXT: vmovups %ymm0, (%rdi)
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: memset_32_nonzero_bytes:
				; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
				; AVX2-NEXT: vmovups %ymm0, (%rdi)
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 32, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 32, i64 -1)
	ret void			ret void
	}			}

	define void @memset_64_nonzero_bytes(i8* %x) {			define void @memset_64_nonzero_bytes(i8* %x) {
	; ANY-LABEL: memset_64_nonzero_bytes:			; SSE2-LABEL: memset_64_nonzero_bytes:
	; ANY: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A			; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A
	; ANY-NEXT: movq %rax, 56(%rdi)			; SSE2-NEXT: movq %rax, 56(%rdi)
	; ANY-NEXT: movq %rax, 48(%rdi)			; SSE2-NEXT: movq %rax, 48(%rdi)
	; ANY-NEXT: movq %rax, 40(%rdi)			; SSE2-NEXT: movq %rax, 40(%rdi)
	; ANY-NEXT: movq %rax, 32(%rdi)			; SSE2-NEXT: movq %rax, 32(%rdi)
	; ANY-NEXT: movq %rax, 24(%rdi)			; SSE2-NEXT: movq %rax, 24(%rdi)
	; ANY-NEXT: movq %rax, 16(%rdi)			; SSE2-NEXT: movq %rax, 16(%rdi)
	; ANY-NEXT: movq %rax, 8(%rdi)			; SSE2-NEXT: movq %rax, 8(%rdi)
	; ANY-NEXT: movq %rax, (%rdi)			; SSE2-NEXT: movq %rax, (%rdi)
	; ANY-NEXT: retq			; SSE2-NEXT: retq
				;
				; AVX1-LABEL: memset_64_nonzero_bytes:
				; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]
				; AVX1-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX1-NEXT: vmovups %ymm0, (%rdi)
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: memset_64_nonzero_bytes:
				; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
				; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX2-NEXT: vmovups %ymm0, (%rdi)
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 64, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 64, i64 -1)
	ret void			ret void
	}			}

	define void @memset_128_nonzero_bytes(i8* %x) {			define void @memset_128_nonzero_bytes(i8* %x) {
	; ANY-LABEL: memset_128_nonzero_bytes:			; SSE2-LABEL: memset_128_nonzero_bytes:
	; ANY: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A			; SSE2: movabsq $3038287259199220266, %rax # imm = 0x2A2A2A2A2A2A2A2A
	; ANY-NEXT: movq %rax, 120(%rdi)			; SSE2-NEXT: movq %rax, 120(%rdi)
	; ANY-NEXT: movq %rax, 112(%rdi)			; SSE2-NEXT: movq %rax, 112(%rdi)
	; ANY-NEXT: movq %rax, 104(%rdi)			; SSE2-NEXT: movq %rax, 104(%rdi)
	; ANY-NEXT: movq %rax, 96(%rdi)			; SSE2-NEXT: movq %rax, 96(%rdi)
	; ANY-NEXT: movq %rax, 88(%rdi)			; SSE2-NEXT: movq %rax, 88(%rdi)
	; ANY-NEXT: movq %rax, 80(%rdi)			; SSE2-NEXT: movq %rax, 80(%rdi)
	; ANY-NEXT: movq %rax, 72(%rdi)			; SSE2-NEXT: movq %rax, 72(%rdi)
	; ANY-NEXT: movq %rax, 64(%rdi)			; SSE2-NEXT: movq %rax, 64(%rdi)
	; ANY-NEXT: movq %rax, 56(%rdi)			; SSE2-NEXT: movq %rax, 56(%rdi)
	; ANY-NEXT: movq %rax, 48(%rdi)			; SSE2-NEXT: movq %rax, 48(%rdi)
	; ANY-NEXT: movq %rax, 40(%rdi)			; SSE2-NEXT: movq %rax, 40(%rdi)
	; ANY-NEXT: movq %rax, 32(%rdi)			; SSE2-NEXT: movq %rax, 32(%rdi)
	; ANY-NEXT: movq %rax, 24(%rdi)			; SSE2-NEXT: movq %rax, 24(%rdi)
	; ANY-NEXT: movq %rax, 16(%rdi)			; SSE2-NEXT: movq %rax, 16(%rdi)
	; ANY-NEXT: movq %rax, 8(%rdi)			; SSE2-NEXT: movq %rax, 8(%rdi)
	; ANY-NEXT: movq %rax, (%rdi)			; SSE2-NEXT: movq %rax, (%rdi)
	; ANY-NEXT: retq			; SSE2-NEXT: retq
				;
				; AVX1-LABEL: memset_128_nonzero_bytes:
				; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]
				; AVX1-NEXT: vmovups %ymm0, 96(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 64(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX1-NEXT: vmovups %ymm0, (%rdi)
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: memset_128_nonzero_bytes:
				; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
				; AVX2-NEXT: vmovups %ymm0, 96(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 64(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX2-NEXT: vmovups %ymm0, (%rdi)
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 128, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 128, i64 -1)
	ret void			ret void
	}			}

	define void @memset_256_nonzero_bytes(i8* %x) {			define void @memset_256_nonzero_bytes(i8* %x) {
	; ANY-LABEL: memset_256_nonzero_bytes:			; SSE2-LABEL: memset_256_nonzero_bytes:
	; ANY: pushq %rax			; SSE2: pushq %rax
	; ANY-NEXT: .Ltmp0:			; SSE2-NEXT: .Ltmp0:
	; ANY-NEXT: .cfi_def_cfa_offset 16			; SSE2-NEXT: .cfi_def_cfa_offset 16
	; ANY-NEXT: movl $42, %esi			; SSE2-NEXT: movl $42, %esi
	; ANY-NEXT: movl $256, %edx # imm = 0x100			; SSE2-NEXT: movl $256, %edx # imm = 0x100
	; ANY-NEXT: callq memset			; SSE2-NEXT: callq memset
	; ANY-NEXT: popq %rax			; SSE2-NEXT: popq %rax
	; ANY-NEXT: retq			; SSE2-NEXT: retq
				;
				; AVX1-LABEL: memset_256_nonzero_bytes:
				; AVX1: vmovaps {{.*#+}} ymm0 = [1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13,1.511366e-13]
				; AVX1-NEXT: vmovups %ymm0, 224(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 192(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 160(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 128(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 96(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 64(%rdi)
				; AVX1-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX1-NEXT: vmovups %ymm0, (%rdi)
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: memset_256_nonzero_bytes:
				; AVX2: vbroadcastss {{.*}}(%rip), %ymm0
				; AVX2-NEXT: vmovups %ymm0, 224(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 192(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 160(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 128(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 96(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 64(%rdi)
				; AVX2-NEXT: vmovups %ymm0, 32(%rdi)
				; AVX2-NEXT: vmovups %ymm0, (%rdi)
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
	;			;
	%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 256, i64 -1)			%call = tail call i8* @__memset_chk(i8* %x, i32 42, i64 256, i64 -1)
	ret void			ret void
	}			}

	declare i8* @__memset_chk(i8*, i32, i64, i64)			declare i8* @__memset_chk(i8*, i32, i64, i64)