This is an archive of the discontinued LLVM Phabricator instance.

[X86] Detect SAD patterns and emit psadbw instructions on X86 redux
ClosedPublic

Authored by mkuper on May 24 2016, 1:52 PM.

Details

Summary

This fixes PR27539 in the SAD detection code originally commited in http://reviews.llvm.org/rL267649
The patch is intentionally against r267722 and not against trunk (in which r267649 has been reverted), to highlight the diff vs. the original patch.

The original patch wasn't happy with reductions where the reduction vector was smaller than the result of the psadbw.
This can be handled by taking the low part of the result of the psadbw - this should be safe, since when the the reduction type is narrower than the psadbw result, the input vectors are also narrow, so the non-zero part of the psadbw result is even narrower.

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper updated this revision to Diff 58307.May 24 2016, 1:52 PM
mkuper retitled this revision from to [X86] Detect SAD patterns and emit psadbw instructions on X86 redux.
mkuper updated this object.
mkuper added reviewers: congh, RKSimon, hfinkel.
mkuper added subscribers: llvm-commits, davidxl, wmi.
mkuper added a reviewer: wmi.May 25 2016, 1:26 PM
wmi added inline comments.May 26 2016, 9:52 AM
lib/Target/X86/X86ISelLowering.cpp
29475 ↗(On Diff #58307)

The condition can be represented using NumConcat == 0

29475 ↗(On Diff #58307)

Can be represented as NumConcat == 0?

29476 ↗(On Diff #58307)

The EXTRACT_SUBVECTOR generated a useless pshufd for the sad_2i8 test.

test/CodeGen/X86/sad.ll
990 ↗(On Diff #58307)

The pshufd is generated because of the EXTRACT_SUBVECTOR to take the lower part of the sad result. however, since we know the sad result except the lower 16 bits are all 0, the pshufd is useless.

Thanks, Wei!

lib/Target/X86/X86ISelLowering.cpp
29475 ↗(On Diff #58307)

Right, I thought it was clearer this way. I can change it to NumConcat == 0 if you prefer. But, to be honest, I'd rather actually get rid of NumConcat in the condition above, and make that an explicit comparison as well.
What do you think?

test/CodeGen/X86/sad.ll
990 ↗(On Diff #58307)

Right, because of the way we legalize v2i32, the extract becomes an anyext instead of a nop.
I'll see if I can get rid of it. Thanks!

wmi added inline comments.May 26 2016, 11:08 AM
lib/Target/X86/X86ISelLowering.cpp
29475 ↗(On Diff #58307)

Get rid of NumConcat is better.

mkuper updated this revision to Diff 58711.May 26 2016, 4:01 PM

Got rid of NumConcat.

Regarding the shuffle - I don't see a good way to get rid of it locally. If you have an idea on how to do that, I'll be happy to hear it.
It can probably be cleaned up with a DAGCombine that recognizes the shuffle we get from the extract is a nop (because the only lanes it shuffles around are zero). Are you ok with leaving a fixme, and following up on that separately if we think it's worthwhile?

wmi edited edge metadata.May 26 2016, 5:05 PM

Regarding the shuffle - I don't see a good way to get rid of it locally. If you have an idea on how to do that, I'll be happy to hear it.

Can we use TRUNCATE instead? SadVT is 2xi64, truncate it to 2xi32.

if (VT.getSizeInBits() < ResVT.getSizeInBits()) {
  MVT ResVT = MVT::getVectorVT(MVT::i32, VT.getSizeInBits() / 32);
  Sad = DAG.getNode(ISD::TRUNCATE, DL, ResVT, SadVT);
}

That works (after a bit of massaging the types). In fact, it makes more sense than using extract_subvector. Thanks a lot!

mkuper updated this revision to Diff 58739.May 26 2016, 6:11 PM
mkuper edited edge metadata.

Eliminated redundant pshufd.

wmi added a comment.May 27 2016, 8:57 AM

LGTM. But I find that the code generated for sad_16i8/sad_32i8 are not very good because of INSERT_SUBVECTOR. Legalization of INSERT_SUBVECTOR introduces many useless stack saves/restores. We may need to fix them in a separate patch before recommit r267649 and this patch.

I agree the spill code is complete nonsense, but I don't think it should block this patch, for two reasons:

  1. We seem to have a real problem with inserts into oversize vectors (e.g. <16 x i32> on SSE2), but I'm not entirely sure it's a high priority. I'm not sure how much of those the vectorizer actually generates - although, for reductions, that may be more common, I haven't checked.
  1. More importantly, even with the spills, it still looks like a large net improvement. E.g. the SSE2 code for the sad16_i8 loop before this patch is:
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	movdqu	a+1024(%rax), %xmm7
	pshufd	$78, %xmm7, %xmm5       # xmm5 = xmm7[2,3,0,1]
	punpcklbw	%xmm10, %xmm5   # xmm5 = xmm5[0],xmm10[0],xmm5[1],xmm10[1],xmm5[2],xmm10[2],xmm5[3],xmm10[3],xmm5[4],xmm10[4],xmm5[5],xmm10[5],xmm5[6],xmm10[6],xmm5[7],xmm10[7]
	movdqa	%xmm5, %xmm6
	punpcklwd	%xmm10, %xmm6   # xmm6 = xmm6[0],xmm10[0],xmm6[1],xmm10[1],xmm6[2],xmm10[2],xmm6[3],xmm10[3]
	punpckhwd	%xmm10, %xmm5   # xmm5 = xmm5[4],xmm10[4],xmm5[5],xmm10[5],xmm5[6],xmm10[6],xmm5[7],xmm10[7]
	punpcklbw	%xmm10, %xmm7   # xmm7 = xmm7[0],xmm10[0],xmm7[1],xmm10[1],xmm7[2],xmm10[2],xmm7[3],xmm10[3],xmm7[4],xmm10[4],xmm7[5],xmm10[5],xmm7[6],xmm10[6],xmm7[7],xmm10[7]
	movdqa	%xmm7, %xmm0
	punpcklwd	%xmm10, %xmm0   # xmm0 = xmm0[0],xmm10[0],xmm0[1],xmm10[1],xmm0[2],xmm10[2],xmm0[3],xmm10[3]
	punpckhwd	%xmm10, %xmm7   # xmm7 = xmm7[4],xmm10[4],xmm7[5],xmm10[5],xmm7[6],xmm10[6],xmm7[7],xmm10[7]
	movdqu	b+1024(%rax), %xmm1
	pshufd	$78, %xmm1, %xmm2       # xmm2 = xmm1[2,3,0,1]
	punpcklbw	%xmm10, %xmm2   # xmm2 = xmm2[0],xmm10[0],xmm2[1],xmm10[1],xmm2[2],xmm10[2],xmm2[3],xmm10[3],xmm2[4],xmm10[4],xmm2[5],xmm10[5],xmm2[6],xmm10[6],xmm2[7],xmm10[7]
	movdqa	%xmm2, %xmm3
	punpcklwd	%xmm10, %xmm3   # xmm3 = xmm3[0],xmm10[0],xmm3[1],xmm10[1],xmm3[2],xmm10[2],xmm3[3],xmm10[3]
	punpckhwd	%xmm10, %xmm2   # xmm2 = xmm2[4],xmm10[4],xmm2[5],xmm10[5],xmm2[6],xmm10[6],xmm2[7],xmm10[7]
	punpcklbw	%xmm10, %xmm1   # xmm1 = xmm1[0],xmm10[0],xmm1[1],xmm10[1],xmm1[2],xmm10[2],xmm1[3],xmm10[3],xmm1[4],xmm10[4],xmm1[5],xmm10[5],xmm1[6],xmm10[6],xmm1[7],xmm10[7]
	movdqa	%xmm1, %xmm4
	punpcklwd	%xmm10, %xmm4   # xmm4 = xmm4[0],xmm10[0],xmm4[1],xmm10[1],xmm4[2],xmm10[2],xmm4[3],xmm10[3]
	punpckhwd	%xmm10, %xmm1   # xmm1 = xmm1[4],xmm10[4],xmm1[5],xmm10[5],xmm1[6],xmm10[6],xmm1[7],xmm10[7]
	psubd	%xmm1, %xmm7
	psubd	%xmm4, %xmm0
	psubd	%xmm2, %xmm5
	psubd	%xmm3, %xmm6
	movdqa	%xmm6, %xmm1
	psrad	$31, %xmm1
	paddd	%xmm1, %xmm6
	pxor	%xmm1, %xmm6
	movdqa	%xmm5, %xmm1
	psrad	$31, %xmm1
	paddd	%xmm1, %xmm5
	pxor	%xmm1, %xmm5
	movdqa	%xmm0, %xmm1
	psrad	$31, %xmm1
	paddd	%xmm1, %xmm0
	pxor	%xmm1, %xmm0
	movdqa	%xmm7, %xmm1
	psrad	$31, %xmm1
	paddd	%xmm1, %xmm7
	pxor	%xmm1, %xmm7
	paddd	%xmm7, %xmm8
	paddd	%xmm0, %xmm9
	paddd	%xmm5, %xmm11
	paddd	%xmm6, %xmm12
	addq	$4, %rax
	jne	.LBB0_1

And with this patch:

.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	movdqa	%xmm0, %xmm4
	movdqu	a+1024(%rax), %xmm5
	movdqu	b+1024(%rax), %xmm0
	movdqa	%xmm4, (%rsp)
	movdqa	%xmm1, 16(%rsp)
	movdqa	%xmm3, 32(%rsp)
	movdqa	%xmm2, 48(%rsp)
	psadbw	%xmm5, %xmm0
	paddd	%xmm4, %xmm0
	movdqa	%xmm0, (%rsp)
	movdqa	16(%rsp), %xmm1
	movdqa	32(%rsp), %xmm3
	movdqa	48(%rsp), %xmm2
	addq	$4, %rax

Are you ok with me committing as is, and filing a PR on the insert_subvector issue?

wmi accepted this revision.May 27 2016, 9:41 AM
wmi edited edge metadata.
This revision is now accepted and ready to land.May 27 2016, 9:41 AM
This revision was automatically updated to reflect the committed changes.