This is an archive of the discontinued LLVM Phabricator instance.

[x86] allow single source horizontal op matching (PR39195)
ClosedPublic

Authored by spatel on Oct 8 2018, 1:03 PM.

Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
dyung
andreadb

Commits

rG6cca8af2270b: [x86] allow single source horizontal op matching (PR39195)
rL344141: [x86] allow single source horizontal op matching (PR39195)

Summary

This is intended to restore horizontal codegen to what it looked like before IR demanded elements improved in:
rL343727

As noted in PR39195:
https://bugs.llvm.org/show_bug.cgi?id=39195
...horizontal ops can be worse for performance than a shuffle+regular binop, so I've added a TODO. Ideally, we'd solve that in a machine instruction pass, but a quicker solution may be adding a 'HasSlowHorizontalOp' feature/bug bit to deal with it here in the DAG.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Oct 8 2018, 1:03 PM

Herald added a subscriber: mcrosier. · View Herald TranscriptOct 8 2018, 1:03 PM

RKSimon added a reviewer: andreadb.Oct 8 2018, 2:42 PM

Hi Sanjay,

On Jaguar only, those unary HADD are going to be as fast as the SHUFFLE+ADD sequence.
In terms of overall latency, both sequences are pretty much equivalent.

HADD has a worse throughput than the SHUFFLE+ADD sequence (1 IPC) mainly because it can only execute on pipe0. SHUFFLE+ADD gives more flexibility to the HW scheduler.
The biggest advantage on Jaguar is that HADD is not microcoded. The XMM variant is fast-path single, which allows us to achieve a better throughput from the decoders (w.r.t. the SHUFFLE+ADD).

I don't see it as a big problem if we start "regressing" this particular case on Jaguar.

I don't have a problem with aggressively selecting HADD at ISel stage, provided that we "undo" that canonicalization in a later (machine combiner?) pass.
Using HADD is not just slow for Intel, it is going to be slow for other AMD processors too. Similarly to what we do for other instructions (CMOV/LEA) which may be further expanded later on.

The problem with having a rule in the machine combiner is that we need to account for register pressure and block frequency too. Essentially, we need a (not too trivial) cost model there; simply comparing code snippets in term of throughput and latency is probably not enough at that stage.
We could have a post-RA pass (before we run the post-RA scheduler) that decides when it is profitable to revert the HADD canonicalization and expand it back to a shuffle+add.

Just my opinion.

In D52997#1258666, @andreadb wrote:

I don't see it as a big problem if we start "regressing" this particular case on Jaguar.

Ok - that's not the impression I got reading PR39195. @dyung - how important is this pattern?

I don't have a problem with aggressively selecting HADD at ISel stage, provided that we "undo" that canonicalization in a later (machine combiner?) pass.
Using HADD is not just slow for Intel, it is going to be slow for other AMD processors too. Similarly to what we do for other instructions (CMOV/LEA) which may be further expanded later on.

Yes, I agree. I even filed a bug. :)
https://bugs.llvm.org/show_bug.cgi?id=26859

So 2 options for moving forward:

Allow this transform as shown here because it is mostly just restoring the behavior of last week. Follow that up with a subtarget feature to prevent the transform (not ideal, but the alternative 'undo' is much harder).
Limit this transform to 'optsize' right now because it's a size win in all cases.

In D52997#1258717, @spatel wrote:

So 2 options for moving forward:

Allow this transform as shown here because it is mostly just restoring the behavior of last week. Follow that up with a subtarget feature to prevent the transform (not ideal, but the alternative 'undo' is much harder).

Limit this transform to 'optsize' right now because it's a size win in all cases.

I'd vote for (1) for this patch - optsize + HasFastHorinzontalOp might be necessary depending on how soon we can agree on a scheduler model driven mechanism that re-expands HADD later on (per Andrea's suggestion - but hopefully we can discuss that at the devmtg)

In D52997#1258720, @RKSimon wrote:

In D52997#1258717, @spatel wrote:

So 2 options for moving forward:

Allow this transform as shown here because it is mostly just restoring the behavior of last week. Follow that up with a subtarget feature to prevent the transform (not ideal, but the alternative 'undo' is much harder).

Limit this transform to 'optsize' right now because it's a size win in all cases.

I'd vote for (1) for this patch - optsize + HasFastHorinzontalOp might be necessary depending on how soon we can agree on a scheduler model driven mechanism that re-expands HADD later on (per Andrea's suggestion - but hopefully we can discuss that at the devmtg)

Same.
I am okay with (1) for now.

LGTM - thanks

This revision is now accepted and ready to land.Oct 9 2018, 10:04 AM

In D52997#1258717, @spatel wrote:

In D52997#1258666, @andreadb wrote:

I don't see it as a big problem if we start "regressing" this particular case on Jaguar.

Ok - that's not the impression I got reading PR39195. @dyung - how important is this pattern?

I will defer to Andrea's expertise in this area. I was under the impression that generating the horizontal add/sub instructions was preferred for these cases (especially for btver2), but if that is not the case, I can update our test to not expect it to be generated.

In D52997#1259162, @dyung wrote:

In D52997#1258717, @spatel wrote:

In D52997#1258666, @andreadb wrote:

I don't see it as a big problem if we start "regressing" this particular case on Jaguar.

Ok - that's not the impression I got reading PR39195. @dyung - how important is this pattern?

I will defer to Andrea's expertise in this area. I was under the impression that generating the horizontal add/sub instructions was preferred for these cases (especially for btver2), but if that is not the case, I can update our test to not expect it to be generated.

Thanks. I think we're all in agreement: we have 2 wrongs making it right (less h-ops) for most CPUs (but not Jaguar) currently. This patch will remove 1 of those wrongs, but there's little regression potential because it's just restoring what happened before (h-ops are unusually good for Jaguar). I should have the follow-up patch to try to make every CPU happy posted soon after this is committed.

Closed by commit rL344141: [x86] allow single source horizontal op matching (PR39195) (authored by spatel). · Explain WhyOct 10 2018, 6:43 AM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D53095: [x86] add and use fast horizontal vector math subtarget feature.Oct 10 2018, 10:27 AM

spatel mentioned this in rL344361: [x86] add and use fast horizontal vector math subtarget feature.Oct 12 2018, 9:43 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

8 lines

test/

CodeGen/

X86/

avx512-intrinsics-fast-isel.ll

12 lines

haddsub-undef.ll

71 lines

phaddsub.ll

96 lines

vector-shuffle-combining.ll

39 lines

Diff 168998

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 37,020 Lines • ▼ Show 20 Lines	for (unsigned i = 0; i != NumEltsPer128BitChunk; ++i) {
// Ignore undefined components.		// Ignore undefined components.
int LIdx = LMask[i + j], RIdx = RMask[i + j];		int LIdx = LMask[i + j], RIdx = RMask[i + j];
if (LIdx < 0 \|\| RIdx < 0 \|\|		if (LIdx < 0 \|\| RIdx < 0 \|\|
(!A.getNode() && (LIdx < (int)NumElts \|\| RIdx < (int)NumElts)) \|\|		(!A.getNode() && (LIdx < (int)NumElts \|\| RIdx < (int)NumElts)) \|\|
(!B.getNode() && (LIdx >= (int)NumElts \|\| RIdx >= (int)NumElts)))		(!B.getNode() && (LIdx >= (int)NumElts \|\| RIdx >= (int)NumElts)))
continue;		continue;

// The low half of the 128-bit result must choose from A.		// The low half of the 128-bit result must choose from A.
// The high half of the 128-bit result must choose from B.		// The high half of the 128-bit result must choose from B,
		// unless B is undef. In that case, we are always choosing from A.
		// TODO: Using a horizontal op on a single input is likely worse for
		// performance on many CPUs, so this should be limited here or reversed
		// in a later pass.
unsigned NumEltsPer64BitChunk = NumEltsPer128BitChunk / 2;		unsigned NumEltsPer64BitChunk = NumEltsPer128BitChunk / 2;
unsigned Src = i >= NumEltsPer64BitChunk;		unsigned Src = B.getNode() ? i >= NumEltsPer64BitChunk : 0;

// Check that successive elements are being operated on. If not, this is		// Check that successive elements are being operated on. If not, this is
// not a horizontal operation.		// not a horizontal operation.
int Index = 2 * (i % NumEltsPer64BitChunk) + NumElts * Src + j;		int Index = 2 * (i % NumEltsPer64BitChunk) + NumElts * Src + j;
if (!(LIdx == Index && RIdx == Index + 1) &&		if (!(LIdx == Index && RIdx == Index + 1) &&
!(IsCommutative && LIdx == Index + 1 && RIdx == Index))		!(IsCommutative && LIdx == Index + 1 && RIdx == Index))
return false;		return false;
}		}
▲ Show 20 Lines • Show All 4,521 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/avx512-intrinsics-fast-isel.ll

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 7,204 Lines • ▼ Show 20 Lines
	; X86-NEXT: movl %esp, %ebp			; X86-NEXT: movl %esp, %ebp
	; X86-NEXT: .cfi_def_cfa_register %ebp			; X86-NEXT: .cfi_def_cfa_register %ebp
	; X86-NEXT: andl $-8, %esp			; X86-NEXT: andl $-8, %esp
	; X86-NEXT: subl $8, %esp			; X86-NEXT: subl $8, %esp
	; X86-NEXT: vextractf64x4 $1, %zmm0, %ymm1			; X86-NEXT: vextractf64x4 $1, %zmm0, %ymm1
	; X86-NEXT: vaddpd %ymm1, %ymm0, %ymm0			; X86-NEXT: vaddpd %ymm1, %ymm0, %ymm0
	; X86-NEXT: vextractf128 $1, %ymm0, %xmm1			; X86-NEXT: vextractf128 $1, %ymm0, %xmm1
	; X86-NEXT: vaddpd %xmm1, %xmm0, %xmm0			; X86-NEXT: vaddpd %xmm1, %xmm0, %xmm0
	; X86-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; X86-NEXT: vhaddpd %xmm0, %xmm0, %xmm0
	; X86-NEXT: vaddpd %xmm1, %xmm0, %xmm0
	; X86-NEXT: vmovlpd %xmm0, (%esp)			; X86-NEXT: vmovlpd %xmm0, (%esp)
	; X86-NEXT: fldl (%esp)			; X86-NEXT: fldl (%esp)
	; X86-NEXT: movl %ebp, %esp			; X86-NEXT: movl %ebp, %esp
	; X86-NEXT: popl %ebp			; X86-NEXT: popl %ebp
	; X86-NEXT: .cfi_def_cfa %esp, 4			; X86-NEXT: .cfi_def_cfa %esp, 4
	; X86-NEXT: vzeroupper			; X86-NEXT: vzeroupper
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test_mm512_reduce_add_pd:			; X64-LABEL: test_mm512_reduce_add_pd:
	; X64: # %bb.0: # %entry			; X64: # %bb.0: # %entry
	; X64-NEXT: vextractf64x4 $1, %zmm0, %ymm1			; X64-NEXT: vextractf64x4 $1, %zmm0, %ymm1
	; X64-NEXT: vaddpd %ymm1, %ymm0, %ymm0			; X64-NEXT: vaddpd %ymm1, %ymm0, %ymm0
	; X64-NEXT: vextractf128 $1, %ymm0, %xmm1			; X64-NEXT: vextractf128 $1, %ymm0, %xmm1
	; X64-NEXT: vaddpd %xmm1, %xmm0, %xmm0			; X64-NEXT: vaddpd %xmm1, %xmm0, %xmm0
	; X64-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; X64-NEXT: vhaddpd %xmm0, %xmm0, %xmm0
	; X64-NEXT: vaddpd %xmm1, %xmm0, %xmm0
	; X64-NEXT: vzeroupper			; X64-NEXT: vzeroupper
	; X64-NEXT: retq			; X64-NEXT: retq
	entry:			entry:
	%shuffle.i = shufflevector <8 x double> %__W, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>			%shuffle.i = shufflevector <8 x double> %__W, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%shuffle1.i = shufflevector <8 x double> %__W, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>			%shuffle1.i = shufflevector <8 x double> %__W, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%add.i = fadd <4 x double> %shuffle.i, %shuffle1.i			%add.i = fadd <4 x double> %shuffle.i, %shuffle1.i
	%shuffle2.i = shufflevector <4 x double> %add.i, <4 x double> undef, <2 x i32> <i32 0, i32 1>			%shuffle2.i = shufflevector <4 x double> %add.i, <4 x double> undef, <2 x i32> <i32 0, i32 1>
	%shuffle3.i = shufflevector <4 x double> %add.i, <4 x double> undef, <2 x i32> <i32 2, i32 3>			%shuffle3.i = shufflevector <4 x double> %add.i, <4 x double> undef, <2 x i32> <i32 2, i32 3>
	▲ Show 20 Lines • Show All 163 Lines • ▼ Show 20 Lines
	; X86-NEXT: subl $8, %esp			; X86-NEXT: subl $8, %esp
	; X86-NEXT: movb 8(%ebp), %al			; X86-NEXT: movb 8(%ebp), %al
	; X86-NEXT: kmovw %eax, %k1			; X86-NEXT: kmovw %eax, %k1
	; X86-NEXT: vmovapd %zmm0, %zmm0 {%k1} {z}			; X86-NEXT: vmovapd %zmm0, %zmm0 {%k1} {z}
	; X86-NEXT: vextractf64x4 $1, %zmm0, %ymm1			; X86-NEXT: vextractf64x4 $1, %zmm0, %ymm1
	; X86-NEXT: vaddpd %ymm1, %ymm0, %ymm0			; X86-NEXT: vaddpd %ymm1, %ymm0, %ymm0
	; X86-NEXT: vextractf128 $1, %ymm0, %xmm1			; X86-NEXT: vextractf128 $1, %ymm0, %xmm1
	; X86-NEXT: vaddpd %xmm1, %xmm0, %xmm0			; X86-NEXT: vaddpd %xmm1, %xmm0, %xmm0
	; X86-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; X86-NEXT: vhaddpd %xmm0, %xmm0, %xmm0
	; X86-NEXT: vaddpd %xmm1, %xmm0, %xmm0
	; X86-NEXT: vmovlpd %xmm0, (%esp)			; X86-NEXT: vmovlpd %xmm0, (%esp)
	; X86-NEXT: fldl (%esp)			; X86-NEXT: fldl (%esp)
	; X86-NEXT: movl %ebp, %esp			; X86-NEXT: movl %ebp, %esp
	; X86-NEXT: popl %ebp			; X86-NEXT: popl %ebp
	; X86-NEXT: .cfi_def_cfa %esp, 4			; X86-NEXT: .cfi_def_cfa %esp, 4
	; X86-NEXT: vzeroupper			; X86-NEXT: vzeroupper
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test_mm512_mask_reduce_add_pd:			; X64-LABEL: test_mm512_mask_reduce_add_pd:
	; X64: # %bb.0: # %entry			; X64: # %bb.0: # %entry
	; X64-NEXT: kmovw %edi, %k1			; X64-NEXT: kmovw %edi, %k1
	; X64-NEXT: vmovapd %zmm0, %zmm0 {%k1} {z}			; X64-NEXT: vmovapd %zmm0, %zmm0 {%k1} {z}
	; X64-NEXT: vextractf64x4 $1, %zmm0, %ymm1			; X64-NEXT: vextractf64x4 $1, %zmm0, %ymm1
	; X64-NEXT: vaddpd %ymm1, %ymm0, %ymm0			; X64-NEXT: vaddpd %ymm1, %ymm0, %ymm0
	; X64-NEXT: vextractf128 $1, %ymm0, %xmm1			; X64-NEXT: vextractf128 $1, %ymm0, %xmm1
	; X64-NEXT: vaddpd %xmm1, %xmm0, %xmm0			; X64-NEXT: vaddpd %xmm1, %xmm0, %xmm0
	; X64-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; X64-NEXT: vhaddpd %xmm0, %xmm0, %xmm0
	; X64-NEXT: vaddpd %xmm1, %xmm0, %xmm0
	; X64-NEXT: vzeroupper			; X64-NEXT: vzeroupper
	; X64-NEXT: retq			; X64-NEXT: retq
	entry:			entry:
	%0 = bitcast i8 %__M to <8 x i1>			%0 = bitcast i8 %__M to <8 x i1>
	%1 = select <8 x i1> %0, <8 x double> %__W, <8 x double> zeroinitializer			%1 = select <8 x i1> %0, <8 x double> %__W, <8 x double> zeroinitializer
	%shuffle.i = shufflevector <8 x double> %1, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>			%shuffle.i = shufflevector <8 x double> %1, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%shuffle1.i = shufflevector <8 x double> %1, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>			%shuffle1.i = shufflevector <8 x double> %1, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%add.i = fadd <4 x double> %shuffle.i, %shuffle1.i			%add.i = fadd <4 x double> %shuffle.i, %shuffle1.i
	▲ Show 20 Lines • Show All 2,368 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/haddsub-undef.ll

Show First 20 Lines • Show All 447 Lines • ▼ Show 20 Lines	; AVX2-NEXT: retq
%add4 = add i32 %vecext6, %vecext7		%add4 = add i32 %vecext6, %vecext7
%vecinit4 = insertelement <8 x i32> %vecinit3, i32 %add4, i32 3		%vecinit4 = insertelement <8 x i32> %vecinit3, i32 %add4, i32 3
ret <8 x i32> %vecinit4		ret <8 x i32> %vecinit4
}		}

define <2 x double> @add_pd_003(<2 x double> %x) {		define <2 x double> @add_pd_003(<2 x double> %x) {
; SSE-LABEL: add_pd_003:		; SSE-LABEL: add_pd_003:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movddup {{.*#+}} xmm1 = xmm0[0,0]		; SSE-NEXT: haddpd %xmm0, %xmm0
; SSE-NEXT: addpd %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: add_pd_003:		; AVX-LABEL: add_pd_003:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vmovddup {{.*#+}} xmm1 = xmm0[0,0]		; AVX-NEXT: vhaddpd %xmm0, %xmm0, %xmm0
; AVX-NEXT: vaddpd %xmm0, %xmm1, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <2 x double> %x, <2 x double> undef, <2 x i32> <i32 undef, i32 0>		%l = shufflevector <2 x double> %x, <2 x double> undef, <2 x i32> <i32 undef, i32 0>
%add = fadd <2 x double> %l, %x		%add = fadd <2 x double> %l, %x
ret <2 x double> %add		ret <2 x double> %add
}		}

; Change shuffle mask - no undefs.		; Change shuffle mask - no undefs.

define <2 x double> @add_pd_003_2(<2 x double> %x) {		define <2 x double> @add_pd_003_2(<2 x double> %x) {
; SSE-LABEL: add_pd_003_2:		; SSE-LABEL: add_pd_003_2:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movapd %xmm0, %xmm1		; SSE-NEXT: haddpd %xmm0, %xmm0
; SSE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1],xmm0[0]
; SSE-NEXT: addpd %xmm0, %xmm1
; SSE-NEXT: movapd %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: add_pd_003_2:		; AVX-LABEL: add_pd_003_2:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]		; AVX-NEXT: vhaddpd %xmm0, %xmm0, %xmm0
; AVX-NEXT: vaddpd %xmm0, %xmm1, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <2 x double> %x, <2 x double> undef, <2 x i32> <i32 1, i32 0>		%l = shufflevector <2 x double> %x, <2 x double> undef, <2 x i32> <i32 1, i32 0>
%add = fadd <2 x double> %l, %x		%add = fadd <2 x double> %l, %x
ret <2 x double> %add		ret <2 x double> %add
}		}

define <2 x double> @add_pd_010(<2 x double> %x) {		define <2 x double> @add_pd_010(<2 x double> %x) {
; SSE-LABEL: add_pd_010:		; SSE-LABEL: add_pd_010:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movddup {{.*#+}} xmm1 = xmm0[0,0]		; SSE-NEXT: haddpd %xmm0, %xmm0
; SSE-NEXT: addpd %xmm0, %xmm1
; SSE-NEXT: unpckhpd {{.*#+}} xmm1 = xmm1[1,1]
; SSE-NEXT: movapd %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: add_pd_010:		; AVX-LABEL: add_pd_010:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vmovddup {{.*#+}} xmm1 = xmm0[0,0]		; AVX-NEXT: vhaddpd %xmm0, %xmm0, %xmm0
; AVX-NEXT: vaddpd %xmm0, %xmm1, %xmm0
; AVX-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0]		; AVX-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <2 x double> %x, <2 x double> undef, <2 x i32> <i32 undef, i32 0>		%l = shufflevector <2 x double> %x, <2 x double> undef, <2 x i32> <i32 undef, i32 0>
%add = fadd <2 x double> %l, %x		%add = fadd <2 x double> %l, %x
%shuffle2 = shufflevector <2 x double> %add, <2 x double> undef, <2 x i32> <i32 1, i32 undef>		%shuffle2 = shufflevector <2 x double> %add, <2 x double> undef, <2 x i32> <i32 1, i32 undef>
ret <2 x double> %shuffle2		ret <2 x double> %shuffle2
}		}

define <4 x float> @add_ps_007(<4 x float> %x) {		define <4 x float> @add_ps_007(<4 x float> %x) {
; SSE-LABEL: add_ps_007:		; SSE-LABEL: add_ps_007:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movaps %xmm0, %xmm1		; SSE-NEXT: haddps %xmm0, %xmm0
; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,1],xmm0[0,2]
; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
; SSE-NEXT: addps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: add_ps_007:		; AVX-LABEL: add_ps_007:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm0[0,1,0,2]		; AVX-NEXT: vhaddps %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,1,1,3]
; AVX-NEXT: vaddps %xmm0, %xmm1, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 2>		%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 2>
%r = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 3>		%r = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 3>
%add = fadd <4 x float> %l, %r		%add = fadd <4 x float> %l, %r
ret <4 x float> %add		ret <4 x float> %add
}		}

define <4 x float> @add_ps_030(<4 x float> %x) {		define <4 x float> @add_ps_030(<4 x float> %x) {
; SSE-LABEL: add_ps_030:		; SSE-LABEL: add_ps_030:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movaps %xmm0, %xmm1		; SSE-NEXT: haddps %xmm0, %xmm0
; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,1],xmm0[0,2]
; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
; SSE-NEXT: addps %xmm1, %xmm0
; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[3,2,2,3]		; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[3,2,2,3]
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: add_ps_030:		; AVX-LABEL: add_ps_030:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm0[0,1,0,2]		; AVX-NEXT: vhaddps %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,1,1,3]
; AVX-NEXT: vaddps %xmm0, %xmm1, %xmm0
; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,2,2,3]		; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,2,2,3]
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 2>		%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 2>
%r = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 3>		%r = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 3>
%add = fadd <4 x float> %l, %r		%add = fadd <4 x float> %l, %r
%shuffle2 = shufflevector <4 x float> %add, <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 undef, i32 undef>		%shuffle2 = shufflevector <4 x float> %add, <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 undef, i32 undef>
ret <4 x float> %shuffle2		ret <4 x float> %shuffle2
}		}

define <4 x float> @add_ps_007_2(<4 x float> %x) {		define <4 x float> @add_ps_007_2(<4 x float> %x) {
; SSE-LABEL: add_ps_007_2:		; SSE-LABEL: add_ps_007_2:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movddup {{.*#+}} xmm1 = xmm0[0,0]		; SSE-NEXT: haddps %xmm0, %xmm0
; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
; SSE-NEXT: addps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: add_ps_007_2:		; AVX-LABEL: add_ps_007_2:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vmovddup {{.*#+}} xmm1 = xmm0[0,0]		; AVX-NEXT: vhaddps %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,1,1,3]
; AVX-NEXT: vaddps %xmm0, %xmm1, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 undef>		%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 undef>
%r = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 undef>		%r = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 undef>
%add = fadd <4 x float> %l, %r		%add = fadd <4 x float> %l, %r
ret <4 x float> %add		ret <4 x float> %add
}		}

define <4 x float> @add_ps_008(<4 x float> %x) {		define <4 x float> @add_ps_008(<4 x float> %x) {
; SSE-LABEL: add_ps_008:		; SSE-LABEL: add_ps_008:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movsldup {{.*#+}} xmm1 = xmm0[0,0,2,2]		; SSE-NEXT: haddps %xmm0, %xmm0
; SSE-NEXT: addps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: add_ps_008:		; AVX-LABEL: add_ps_008:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vmovsldup {{.*#+}} xmm1 = xmm0[0,0,2,2]		; AVX-NEXT: vhaddps %xmm0, %xmm0, %xmm0
; AVX-NEXT: vaddps %xmm0, %xmm1, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>		%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
%add = fadd <4 x float> %l, %x		%add = fadd <4 x float> %l, %x
ret <4 x float> %add		ret <4 x float> %add
}		}

define <4 x float> @add_ps_017(<4 x float> %x) {		define <4 x float> @add_ps_017(<4 x float> %x) {
; SSE-LABEL: add_ps_017:		; SSE-LABEL: add_ps_017:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movsldup {{.*#+}} xmm1 = xmm0[0,0,2,2]		; SSE-NEXT: haddps %xmm0, %xmm0
; SSE-NEXT: addps %xmm0, %xmm1		; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[3,1,2,3]
; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[3,1,2,3]
; SSE-NEXT: movaps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: add_ps_017:		; AVX-LABEL: add_ps_017:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vmovsldup {{.*#+}} xmm1 = xmm0[0,0,2,2]		; AVX-NEXT: vhaddps %xmm0, %xmm0, %xmm0
; AVX-NEXT: vaddps %xmm0, %xmm1, %xmm0
; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]		; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>		%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
%add = fadd <4 x float> %l, %x		%add = fadd <4 x float> %l, %x
%shuffle2 = shufflevector <4 x float> %add, <4 x float> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef>		%shuffle2 = shufflevector <4 x float> %add, <4 x float> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef>
ret <4 x float> %shuffle2		ret <4 x float> %shuffle2
}		}

define <4 x float> @add_ps_018(<4 x float> %x) {		define <4 x float> @add_ps_018(<4 x float> %x) {
; SSE-LABEL: add_ps_018:		; SSE-LABEL: add_ps_018:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movddup {{.*#+}} xmm1 = xmm0[0,0]		; SSE-NEXT: haddps %xmm0, %xmm0
; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
; SSE-NEXT: addps %xmm1, %xmm0
; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2,2,3]		; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2,2,3]
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: add_ps_018:		; AVX-LABEL: add_ps_018:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vmovddup {{.*#+}} xmm1 = xmm0[0,0]		; AVX-NEXT: vhaddps %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,1,1,3]
; AVX-NEXT: vaddps %xmm0, %xmm1, %xmm0
; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]		; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 undef>		%l = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 undef>
%r = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 undef>		%r = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 undef>
%add = fadd <4 x float> %l, %r		%add = fadd <4 x float> %l, %r
%shuffle2 = shufflevector <4 x float> %add, <4 x float> undef, <4 x i32> <i32 undef, i32 2, i32 undef, i32 undef>		%shuffle2 = shufflevector <4 x float> %add, <4 x float> undef, <4 x i32> <i32 undef, i32 2, i32 undef, i32 undef>
ret <4 x float> %shuffle2		ret <4 x float> %shuffle2
}		}

llvm/trunk/test/CodeGen/X86/phaddsub.ll

Show First 20 Lines • Show All 280 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
%b = shufflevector <4 x i32> %x, <4 x i32> %y, <4 x i32> <i32 0, i32 2, i32 4, i32 6>		%b = shufflevector <4 x i32> %x, <4 x i32> %y, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%r = sub <4 x i32> %a, %b		%r = sub <4 x i32> %a, %b
ret <4 x i32> %r		ret <4 x i32> %r
}		}

define <4 x i32> @phaddd_single_source1(<4 x i32> %x) {		define <4 x i32> @phaddd_single_source1(<4 x i32> %x) {
; SSSE3-LABEL: phaddd_single_source1:		; SSSE3-LABEL: phaddd_single_source1:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,0,2]		; SSSE3-NEXT: phaddd %xmm0, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
; SSSE3-NEXT: paddd %xmm1, %xmm0
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; AVX-LABEL: phaddd_single_source1:		; AVX-LABEL: phaddd_single_source1:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[0,1,0,2]		; AVX-NEXT: vphaddd %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
; AVX-NEXT: vpaddd %xmm0, %xmm1, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 2>		%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 2>
%r = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 3>		%r = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 3>
%add = add <4 x i32> %l, %r		%add = add <4 x i32> %l, %r
ret <4 x i32> %add		ret <4 x i32> %add
}		}

define <4 x i32> @phaddd_single_source2(<4 x i32> %x) {		define <4 x i32> @phaddd_single_source2(<4 x i32> %x) {
; SSSE3-LABEL: phaddd_single_source2:		; SSSE3-LABEL: phaddd_single_source2:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,0,2]		; SSSE3-NEXT: phaddd %xmm0, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
; SSSE3-NEXT: paddd %xmm1, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[3,2,2,3]		; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[3,2,2,3]
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; AVX-LABEL: phaddd_single_source2:		; AVX-LABEL: phaddd_single_source2:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[0,1,0,2]		; AVX-NEXT: vphaddd %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
; AVX-NEXT: vpaddd %xmm0, %xmm1, %xmm0
; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[3,2,2,3]		; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[3,2,2,3]
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 2>		%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 2>
%r = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 3>		%r = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 3>
%add = add <4 x i32> %l, %r		%add = add <4 x i32> %l, %r
%shuffle2 = shufflevector <4 x i32> %add, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 undef, i32 undef>		%shuffle2 = shufflevector <4 x i32> %add, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 undef, i32 undef>
ret <4 x i32> %shuffle2		ret <4 x i32> %shuffle2
}		}

define <4 x i32> @phaddd_single_source3(<4 x i32> %x) {		define <4 x i32> @phaddd_single_source3(<4 x i32> %x) {
; SSSE3-LABEL: phaddd_single_source3:		; SSSE3-LABEL: phaddd_single_source3:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,0,1]		; SSSE3-NEXT: phaddd %xmm0, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
; SSSE3-NEXT: paddd %xmm1, %xmm0
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; AVX-LABEL: phaddd_single_source3:		; AVX-LABEL: phaddd_single_source3:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[0,1,0,1]		; AVX-NEXT: vphaddd %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
; AVX-NEXT: vpaddd %xmm0, %xmm1, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 undef>		%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 undef>
%r = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 undef>		%r = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 undef>
%add = add <4 x i32> %l, %r		%add = add <4 x i32> %l, %r
ret <4 x i32> %add		ret <4 x i32> %add
}		}

define <4 x i32> @phaddd_single_source4(<4 x i32> %x) {		define <4 x i32> @phaddd_single_source4(<4 x i32> %x) {
; SSSE3-LABEL: phaddd_single_source4:		; SSSE3-LABEL: phaddd_single_source4:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,2,2]		; SSSE3-NEXT: phaddd %xmm0, %xmm0
; SSSE3-NEXT: paddd %xmm1, %xmm0
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; AVX-LABEL: phaddd_single_source4:		; AVX-LABEL: phaddd_single_source4:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[0,1,2,2]		; AVX-NEXT: vphaddd %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpaddd %xmm0, %xmm1, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>		%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
%add = add <4 x i32> %l, %x		%add = add <4 x i32> %l, %x
ret <4 x i32> %add		ret <4 x i32> %add
}		}

define <4 x i32> @phaddd_single_source5(<4 x i32> %x) {		define <4 x i32> @phaddd_single_source5(<4 x i32> %x) {
; SSSE3-LABEL: phaddd_single_source5:		; SSSE3-LABEL: phaddd_single_source5:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,2,2]		; SSSE3-NEXT: phaddd %xmm0, %xmm0
; SSSE3-NEXT: paddd %xmm0, %xmm1		; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[3,1,2,3]
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm1[3,1,2,3]
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; AVX-LABEL: phaddd_single_source5:		; AVX-LABEL: phaddd_single_source5:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[0,1,2,2]		; AVX-NEXT: vphaddd %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpaddd %xmm0, %xmm1, %xmm0
; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[3,1,2,3]		; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[3,1,2,3]
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>		%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
%add = add <4 x i32> %l, %x		%add = add <4 x i32> %l, %x
%shuffle2 = shufflevector <4 x i32> %add, <4 x i32> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef>		%shuffle2 = shufflevector <4 x i32> %add, <4 x i32> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef>
ret <4 x i32> %shuffle2		ret <4 x i32> %shuffle2
}		}

define <4 x i32> @phaddd_single_source6(<4 x i32> %x) {		define <4 x i32> @phaddd_single_source6(<4 x i32> %x) {
; SSSE3-LABEL: phaddd_single_source6:		; SSSE3-LABEL: phaddd_single_source6:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,0,1]		; SSSE3-NEXT: phaddd %xmm0, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
; SSSE3-NEXT: paddd %xmm1, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; AVX-LABEL: phaddd_single_source6:		; AVX-LABEL: phaddd_single_source6:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[0,1,0,1]		; AVX-NEXT: vphaddd %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
; AVX-NEXT: vpaddd %xmm0, %xmm1, %xmm0
; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 undef>		%l = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 0, i32 undef>
%r = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 undef>		%r = shufflevector <4 x i32> %x, <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 undef>
%add = add <4 x i32> %l, %r		%add = add <4 x i32> %l, %r
%shuffle2 = shufflevector <4 x i32> %add, <4 x i32> undef, <4 x i32> <i32 undef, i32 2, i32 undef, i32 undef>		%shuffle2 = shufflevector <4 x i32> %add, <4 x i32> undef, <4 x i32> <i32 undef, i32 2, i32 undef, i32 undef>
ret <4 x i32> %shuffle2		ret <4 x i32> %shuffle2
}		}

define <8 x i16> @phaddw_single_source1(<8 x i16> %x) {		define <8 x i16> @phaddw_single_source1(<8 x i16> %x) {
; SSSE3-LABEL: phaddw_single_source1:		; SSSE3-LABEL: phaddw_single_source1:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: movdqa %xmm0, %xmm1		; SSSE3-NEXT: phaddw %xmm0, %xmm0
; SSSE3-NEXT: pshufb {{.*#+}} xmm1 = xmm1[0,1,4,5,4,5,6,7,0,1,4,5,8,9,12,13]
; SSSE3-NEXT: pshufb {{.*#+}} xmm0 = xmm0[6,7,2,3,4,5,6,7,2,3,6,7,10,11,14,15]
; SSSE3-NEXT: paddw %xmm1, %xmm0
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; AVX-LABEL: phaddw_single_source1:		; AVX-LABEL: phaddw_single_source1:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpshufb {{.*#+}} xmm1 = xmm0[0,1,4,5,4,5,6,7,0,1,4,5,8,9,12,13]		; AVX-NEXT: vphaddw %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[6,7,2,3,4,5,6,7,2,3,6,7,10,11,14,15]
; AVX-NEXT: vpaddw %xmm0, %xmm1, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 2, i32 4, i32 6>		%l = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 2, i32 4, i32 6>
%r = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 3, i32 5, i32 7>		%r = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 3, i32 5, i32 7>
%add = add <8 x i16> %l, %r		%add = add <8 x i16> %l, %r
ret <8 x i16> %add		ret <8 x i16> %add
}		}

define <8 x i16> @phaddw_single_source2(<8 x i16> %x) {		define <8 x i16> @phaddw_single_source2(<8 x i16> %x) {
; SSSE3-LABEL: phaddw_single_source2:		; SSSE3-LABEL: phaddw_single_source2:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]		; SSSE3-NEXT: phaddw %xmm0, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm1[0,1,0,3]
; SSSE3-NEXT: pshuflw {{.*#+}} xmm0 = xmm0[1,3,2,3,4,5,6,7]
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,0,3]
; SSSE3-NEXT: paddw %xmm1, %xmm0
; SSSE3-NEXT: pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,6,7]		; SSSE3-NEXT: pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,6,7]
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,1,2,3]		; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,1,2,3]
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; AVX-LABEL: phaddw_single_source2:		; AVX-LABEL: phaddw_single_source2:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]		; AVX-NEXT: vphaddw %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,3]
; AVX-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[1,3,2,3,4,5,6,7]
; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,1,0,3]
; AVX-NEXT: vpaddw %xmm0, %xmm1, %xmm0
; AVX-NEXT: vpshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,6,7]		; AVX-NEXT: vpshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,6,7]
; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,1,2,3]		; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,1,2,3]
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 2, i32 4, i32 6>		%l = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 2, i32 4, i32 6>
%r = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 3, i32 5, i32 7>		%r = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 3, i32 5, i32 7>
%add = add <8 x i16> %l, %r		%add = add <8 x i16> %l, %r
%shuffle2 = shufflevector <8 x i16> %add, <8 x i16> undef, <8 x i32> <i32 5, i32 4, i32 3, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>		%shuffle2 = shufflevector <8 x i16> %add, <8 x i16> undef, <8 x i32> <i32 5, i32 4, i32 3, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>
ret <8 x i16> %shuffle2		ret <8 x i16> %shuffle2
}		}

define <8 x i16> @phaddw_single_source3(<8 x i16> %x) {		define <8 x i16> @phaddw_single_source3(<8 x i16> %x) {
; SSSE3-LABEL: phaddw_single_source3:		; SSSE3-LABEL: phaddw_single_source3:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]		; SSSE3-NEXT: phaddw %xmm0, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm1[0,1,0,3]
; SSSE3-NEXT: pshuflw {{.*#+}} xmm0 = xmm0[1,3,2,3,4,5,6,7]
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,0,3]
; SSSE3-NEXT: paddw %xmm1, %xmm0
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; AVX-LABEL: phaddw_single_source3:		; AVX-LABEL: phaddw_single_source3:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]		; AVX-NEXT: vphaddw %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,3]
; AVX-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[1,3,2,3,4,5,6,7]
; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,1,0,3]
; AVX-NEXT: vpaddw %xmm0, %xmm1, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 2, i32 undef, i32 undef>		%l = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 2, i32 undef, i32 undef>
%r = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 3, i32 undef, i32 undef>		%r = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 3, i32 undef, i32 undef>
%add = add <8 x i16> %l, %r		%add = add <8 x i16> %l, %r
ret <8 x i16> %add		ret <8 x i16> %add
}		}

define <8 x i16> @phaddw_single_source4(<8 x i16> %x) {		define <8 x i16> @phaddw_single_source4(<8 x i16> %x) {
; SSSE3-LABEL: phaddw_single_source4:		; SSSE3-LABEL: phaddw_single_source4:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: movdqa %xmm0, %xmm1		; SSSE3-NEXT: phaddw %xmm0, %xmm0
; SSSE3-NEXT: pslld $16, %xmm1
; SSSE3-NEXT: paddw %xmm0, %xmm1
; SSSE3-NEXT: movdqa %xmm1, %xmm0
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; AVX-LABEL: phaddw_single_source4:		; AVX-LABEL: phaddw_single_source4:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpslld $16, %xmm0, %xmm1		; AVX-NEXT: vphaddw %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpaddw %xmm0, %xmm1, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 6>		%l = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 6>
%add = add <8 x i16> %l, %x		%add = add <8 x i16> %l, %x
ret <8 x i16> %add		ret <8 x i16> %add
}		}

define <8 x i16> @phaddw_single_source6(<8 x i16> %x) {		define <8 x i16> @phaddw_single_source6(<8 x i16> %x) {
; SSSE3-LABEL: phaddw_single_source6:		; SSSE3-LABEL: phaddw_single_source6:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,0,1]		; SSSE3-NEXT: phaddw %xmm0, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,0,3]
; SSSE3-NEXT: pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,5,6,7]
; SSSE3-NEXT: paddw %xmm1, %xmm0
; SSSE3-NEXT: psrldq {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero		; SSSE3-NEXT: psrldq {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; AVX-LABEL: phaddw_single_source6:		; AVX-LABEL: phaddw_single_source6:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[0,1,0,1]		; AVX-NEXT: vphaddw %xmm0, %xmm0, %xmm0
; AVX-NEXT: vpmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
; AVX-NEXT: vpaddw %xmm0, %xmm1, %xmm0
; AVX-NEXT: vpsrldq {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero		; AVX-NEXT: vpsrldq {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero
; AVX-NEXT: retq		; AVX-NEXT: retq
%l = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 undef, i32 undef, i32 undef>		%l = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 undef, i32 undef, i32 undef>
%r = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 undef, i32 undef, i32 undef>		%r = shufflevector <8 x i16> %x, <8 x i16> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 undef, i32 undef, i32 undef>
%add = add <8 x i16> %l, %r		%add = add <8 x i16> %l, %r
%shuffle2 = shufflevector <8 x i16> %add, <8 x i16> undef, <8 x i32> <i32 undef, i32 4, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>		%shuffle2 = shufflevector <8 x i16> %add, <8 x i16> undef, <8 x i32> <i32 undef, i32 4, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
ret <8 x i16> %shuffle2		ret <8 x i16> %shuffle2
}		}

llvm/trunk/test/CodeGen/X86/vector-shuffle-combining.ll

	Show First 20 Lines • Show All 2,694 Lines • ▼ Show 20 Lines
	; AVX-NEXT: vpinsrd $0, %edi, %xmm0, %xmm0			; AVX-NEXT: vpinsrd $0, %edi, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%a0 = insertelement <4 x i32> undef, i32 %f, i32 0			%a0 = insertelement <4 x i32> undef, i32 %f, i32 0
	%ret = shufflevector <4 x i32> %a0, <4 x i32> <i32 undef, i32 4, i32 5, i32 30>, <4 x i32> <i32 0, i32 5, i32 6, i32 7>			%ret = shufflevector <4 x i32> %a0, <4 x i32> <i32 undef, i32 4, i32 5, i32 30>, <4 x i32> <i32 0, i32 5, i32 6, i32 7>
	ret <4 x i32> %ret			ret <4 x i32> %ret
	}			}

	define <4 x float> @PR22377(<4 x float> %a, <4 x float> %b) {			define <4 x float> @PR22377(<4 x float> %a, <4 x float> %b) {
	; SSE-LABEL: PR22377:			; SSE2-LABEL: PR22377:
	; SSE: # %bb.0: # %entry			; SSE2: # %bb.0: # %entry
	; SSE-NEXT: movaps %xmm0, %xmm1			; SSE2-NEXT: movaps %xmm0, %xmm1
	; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,3],xmm0[1,3]			; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,3],xmm0[1,3]
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2,0,2]			; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2,0,2]
	; SSE-NEXT: addps %xmm0, %xmm1			; SSE2-NEXT: addps %xmm0, %xmm1
	; SSE-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]			; SSE2-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; SSE-NEXT: retq			; SSE2-NEXT: retq
				;
				; SSSE3-LABEL: PR22377:
				; SSSE3: # %bb.0: # %entry
				; SSSE3-NEXT: movaps %xmm0, %xmm1
				; SSSE3-NEXT: haddps %xmm0, %xmm1
				; SSSE3-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,1]
				; SSSE3-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2,1,3]
				; SSSE3-NEXT: retq
				;
				; SSE41-LABEL: PR22377:
				; SSE41: # %bb.0: # %entry
				; SSE41-NEXT: movaps %xmm0, %xmm1
				; SSE41-NEXT: haddps %xmm0, %xmm1
				; SSE41-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,1]
				; SSE41-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2,1,3]
				; SSE41-NEXT: retq
	;			;
	; AVX-LABEL: PR22377:			; AVX-LABEL: PR22377:
	; AVX: # %bb.0: # %entry			; AVX: # %bb.0: # %entry
	; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm0[1,3,1,3]			; AVX-NEXT: vhaddps %xmm0, %xmm0, %xmm1
	; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,2,0,2]			; AVX-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,1]
	; AVX-NEXT: vaddps %xmm0, %xmm1, %xmm1			; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,2,1,3]
	; AVX-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; AVX-NEXT: retq			; AVX-NEXT: retq
	entry:			entry:
	%s1 = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32> <i32 1, i32 3, i32 1, i32 3>			%s1 = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32> <i32 1, i32 3, i32 1, i32 3>
	%s2 = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32> <i32 0, i32 2, i32 0, i32 2>			%s2 = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32> <i32 0, i32 2, i32 0, i32 2>
	%r2 = fadd <4 x float> %s1, %s2			%r2 = fadd <4 x float> %s1, %s2
	%s3 = shufflevector <4 x float> %s2, <4 x float> %r2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>			%s3 = shufflevector <4 x float> %s2, <4 x float> %r2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
	ret <4 x float> %s3			ret <4 x float> %s3
	}			}
	▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines