This is an archive of the discontinued LLVM Phabricator instance.

[x86] use SSE/AVX ops for non-zero memsets (PR27100)
ClosedPublic

Authored by spatel on Mar 29 2016, 11:50 AM.

Details

Summary

This is a one-line-of-code patch for:
https://llvm.org/bugs/show_bug.cgi?id=27100

The reasoning that I'm hoping will hold is that we shouldn't discriminate a memset operation from a memcpy at this level because they have exactly the same load/store instruction type requirements.

But as the test cases show, there's some ugliness here:

  1. The i386 (Windows) test expands to use 32 stores instead of a 'rep stosl'. Is that better or worse? (I'm not sure why this change even happens yet.)
  2. The memset-2.ll tests look quite awkward in the way they splat the byte value into an XMM reg; imul isn't generally cheap.
  3. Why do the memset-nonzero.ll tests for an AVX1 target not use vbroadcast like the AVX2 target?
  4. Why does the machine scheduler reorder the DAG nodes? In all cases, we create those store nodes in low-to-high mem address order, but that's not how the machine instructions come out.

I don't think any of the above are big enough problems to prevent this patch from going in first, but we should improve those.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 51958.Mar 29 2016, 11:50 AM
spatel retitled this revision from to [x86] use SSE/AVX ops for non-zero memsets (PR27100).
spatel updated this object.
spatel added a subscriber: llvm-commits.
zansari edited edge metadata.Mar 29 2016, 1:11 PM
The memset-2.ll tests look quite awkward in the way they splat the byte value into an XMM reg; imul isn't generally cheap.

This would be my biggest concern out of all of the other ugliness.
I agree that the imul is ugly, but so is all of the other extra code generated to broadcast the byte into an xmm. I'm guessing that this is why the "zeromemset" guard was put there, specifically, to allow memsets with cheap immediates through.

It looks like the code that expands the memset is pretty inefficient. This is also what I see with a memset(m, v, 16) :

movzbl	4(%esp), %ecx
movl	$16843009, %edx         # imm = 0x1010101
movl	%ecx, %eax
mull	%edx
movd	%eax, %xmm0
imull	$16843009, %ecx, %eax   # imm = 0x1010101
addl	%edx, %eax
movd	%eax, %xmm1
punpckldq	%xmm1, %xmm0    # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
movq	%xmm0, a+8
movq	%xmm0, a

.. in addition to all the gackiness, also notice that we're only doing 8B stores after all of that.

I like the change, but any chance we could fix this issue before committing this change? We should really only be generating a couple of shifts/ors and a shuffle, followed by full 16B stores.

Thanks,
Zia.

hjl.tools edited edge metadata.Mar 29 2016, 1:16 PM

For 16-byte memset, we can generate

	movzbl	4(%esp), %eax
	imull	$16843009, %eax, %eax
	movd	%eax, %xmm0
	movl	dst, %eax
	pshufd	$0, %xmm0, %xmm0
	movdqu	%xmm0, (%eax)
	ret

Oops. I meant

	movzbl	8(%esp), %eax
	imull	$16843009, %eax, %eax
	movd	%eax, %xmm0
	movl	4(%esp), %eax
	pshufd	$0, %xmm0, %xmm0
	movdqu	%xmm0, (%eax)
	ret

Depending on the cost of imull, you can avoid it with:

movzbl    4(%esp), %ecx
movl      %ecx, %eax
shll      $8, %eax
orl       %eax, %ecx
movl      %ecx, %edx
shll      $16, %edx
orl       %edx, %ecx
movd      %ecx, %xmm0
pshufd    $0, %xmm0, %xmm1

.. in addition to all the gackiness, also notice that we're only doing 8B stores after all of that.

I like the change, but any chance we could fix this issue before committing this change? We should really only be generating a couple of shifts/ors and a shuffle, followed by full 16B stores.

Clearly, the one-line patch was too ambitious. :)

What we're seeing in some of these changes is that we're hitting what I hope is a weird corner case: a slow unaligned SSE store implementation (ie, before SSE4.2) with a 32-bit OS. On 2nd thought, maybe that's not so weird.

In any case, I will fix the patch to preserve that existing behavior. By just loosening the restriction on the non-zero memset for fast CPUs, we'll avoid the strange codegen and still get the benefits shown in PR27100.

spatel updated this revision to Diff 52003.Mar 29 2016, 4:26 PM
spatel edited edge metadata.

Patch updated:
Move the memset check down to the slow SSE case: this allows fast targets to take advantage of SSE/AVX instructions and prevents slow targets from stepping into a codegen sinkhole while trying to splat a byte into an XMM reg.

Note that, unlike the previous rev of the patch, all existing regression tests remain unchanged except for the tests that I added to model the request in PR27100.

We still have the questions of AVX1 codegen and unexpected machine scheduler behavior, but I think we can address those separately.

zansari accepted this revision.Mar 30 2016, 2:10 PM
zansari edited edge metadata.

Thanks, Sanjay.. This lgmt.

This revision is now accepted and ready to land.Mar 30 2016, 2:10 PM
This revision was automatically updated to reflect the committed changes.