This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
shrink_vmul.ll
-
slow-pmulld.ll

Differential D41484

[X86][SSE] Use PMADDWD for v4i32 multiplies with 17 or more leading zeros
ClosedPublic

Authored by RKSimon on Dec 21 2017, 4:26 AM.

Download Raw Diff

Details

Reviewers

craig.topper
pcordes
zvi
spatel

Commits

rG62411e4d4f70: [X86][SSE] Use PMADDWD for v4i32 multiplies with 17 or more leading zeros
rL321516: [X86][SSE] Use PMADDWD for v4i32 multiplies with 17 or more leading zeros

Summary

If there are 17 or more leading zeros to the v4i32 elements, then we can use PMADD for the integer multiply when PMULLD is unavailable or slow.

The 17 bits need to be zero as the PMADDWD performs a v8i16 signed-mul-extend + pairwise-add - the upper 16 so we're adding a zero pair and the 17th bit so we don't incorrectly sign extend.

If people want I can try to incorporate this more into the ShrinkMode enum returned by canReduceVMulWidth ?

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon created this revision.Dec 21 2017, 4:26 AM

craig.topper added inline comments.Dec 21 2017, 2:00 PM

test/CodeGen/X86/shrink_vmul.ll
1 ↗	(On Diff #127859)	Why doesn't this test have any avx command lines. I assume some of the unpcks in the modified test case would be a zero extend on newer feature sets?

Rebased after adding AVX tests to shrink_vmul.ll

LGTM

I didn't realize when I made that avx comment that shrink vmul only applies to pre-sse4.1

This revision is now accepted and ready to land.Dec 27 2017, 12:38 PM

Closed by commit rL321516: [X86][SSE] Use PMADDWD for v4i32 multiplies with 17 or more leading zeros (authored by RKSimon). · Explain WhyDec 28 2017, 2:06 AM

This revision was automatically updated to reflect the committed changes.

In D41484#964571, @craig.topper wrote:

LGTM

I didn't realize when I made that avx comment that shrink vmul only applies to pre-sse4.1

Thanks - I'm wondering whether we should try to use MADD for SSE41+ targets as well - realistically v2Xi16 multiplies are always going to be faster than vXi32 (1cy or more latency saving according to Agner). Similar to your avx512 vXi64 multiply patches I guess.

In D41484#964762, @RKSimon wrote:

I'm wondering whether we should try to use MADD for SSE41+ targets as well

Yes, absolutely. Look for alternatives to PMULLD whenever possible except with -march=sandybridge / ivybridge, or KNL.

PMADDWD has twice the throughput (and half the latency) of PMULLD on Haswell and Skylake. (Although Skylake does have vector-integer multiply on two ports, so PMULLD is 10c latency, 1c throughput). PMULLD is also half throughput on Core2 (4 uops) and Nehalem (2 uops).

On Jaguar it's half-throughput like on Haswell. On Silvermont, it's 7 uops with 11c throughput (11x worse than PMADDWD).

On Ryzen, they're both single-uop, but PMADDWD has 3c instead of 4c latency, and 1c instead of 2c throughput. Same thing on Bulldozer-family: 4c vs. 5c latency, and 1c vs. 2c throughput.

PMULUDQ (widening multiply of the even elements) is usually as fast as PMADDWD, but 32-bit low-half PMULLD multiply is slow on everything except Intel Sandybridge / Ivybridge, and KNL. The throughput penalty is at least a factor of 2 on CPUs other than those.

RKSimon mentioned this in D42258: [X86][SSE] Aggressively use PMADDWD for v4i32 multiplies with 17 or more leading zeros.Jan 18 2018, 11:44 AM

RKSimon mentioned this in rL323367: [X86][SSE] Aggressively use PMADDWD for v4i32 multiplies with 17 or more….Jan 24 2018, 11:24 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

14 lines

test/

CodeGen/

X86/

shrink_vmul.ll

46 lines

slow-pmulld.ll

16 lines

Diff 128276

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 22,080 Lines • ▼ Show 20 Lines	if (VT == MVT::v16i8 \|\| VT == MVT::v32i8 \|\| VT == MVT::v64i8) {
return DAG.getNode(X86ISD::PACKUS, dl, VT, RLo, RHi);		return DAG.getNode(X86ISD::PACKUS, dl, VT, RLo, RHi);
}		}

// Lower v4i32 mul as 2x shuffle, 2x pmuludq, 2x shuffle.		// Lower v4i32 mul as 2x shuffle, 2x pmuludq, 2x shuffle.
if (VT == MVT::v4i32) {		if (VT == MVT::v4i32) {
assert(Subtarget.hasSSE2() && !Subtarget.hasSSE41() &&		assert(Subtarget.hasSSE2() && !Subtarget.hasSSE41() &&
"Should not custom lower when pmulld is available!");		"Should not custom lower when pmulld is available!");

		// If the upper 17 bits of each element are zero then we can use PMADD.
		APInt Mask17 = APInt::getHighBitsSet(32, 17);
		if (DAG.MaskedValueIsZero(A, Mask17) && DAG.MaskedValueIsZero(B, Mask17))
		return DAG.getNode(X86ISD::VPMADDWD, dl, VT,
		DAG.getBitcast(MVT::v8i16, A),
		DAG.getBitcast(MVT::v8i16, B));

// Extract the odd parts.		// Extract the odd parts.
static const int UnpackMask[] = { 1, -1, 3, -1 };		static const int UnpackMask[] = { 1, -1, 3, -1 };
SDValue Aodds = DAG.getVectorShuffle(VT, dl, A, A, UnpackMask);		SDValue Aodds = DAG.getVectorShuffle(VT, dl, A, A, UnpackMask);
SDValue Bodds = DAG.getVectorShuffle(VT, dl, B, B, UnpackMask);		SDValue Bodds = DAG.getVectorShuffle(VT, dl, B, B, UnpackMask);

// Multiply the even parts.		// Multiply the even parts.
SDValue Evens = DAG.getNode(X86ISD::PMULUDQ, dl, MVT::v2i64, A, B);		SDValue Evens = DAG.getNode(X86ISD::PMULUDQ, dl, MVT::v2i64, A, B);
// Now multiply odd parts.		// Now multiply odd parts.
▲ Show 20 Lines • Show All 10,157 Lines • ▼ Show 20 Lines	static SDValue reduceVMULWidth(SDNode *N, SelectionDAG &DAG,
SDLoc DL(N);		SDLoc DL(N);
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
EVT VT = N->getOperand(0).getValueType();		EVT VT = N->getOperand(0).getValueType();
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
if ((NumElts % 2) != 0)		if ((NumElts % 2) != 0)
return SDValue();		return SDValue();

		// If the upper 17 bits of each element are zero then we can use PMADD.
		APInt Mask17 = APInt::getHighBitsSet(32, 17);
		if (VT == MVT::v4i32 && DAG.MaskedValueIsZero(N0, Mask17) &&
		DAG.MaskedValueIsZero(N1, Mask17))
		return DAG.getNode(X86ISD::VPMADDWD, DL, VT, DAG.getBitcast(MVT::v8i16, N0),
		DAG.getBitcast(MVT::v8i16, N1));

unsigned RegSize = 128;		unsigned RegSize = 128;
MVT OpsVT = MVT::getVectorVT(MVT::i16, RegSize / 16);		MVT OpsVT = MVT::getVectorVT(MVT::i16, RegSize / 16);
EVT ReducedVT = EVT::getVectorVT(*DAG.getContext(), MVT::i16, NumElts);		EVT ReducedVT = EVT::getVectorVT(*DAG.getContext(), MVT::i16, NumElts);

// Shrink the operands of mul.		// Shrink the operands of mul.
SDValue NewN0 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, N0);		SDValue NewN0 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, N0);
SDValue NewN1 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, N1);		SDValue NewN1 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, N1);

▲ Show 20 Lines • Show All 6,234 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/shrink_vmul.ll

	Show First 20 Lines • Show All 106 Lines • ▼ Show 20 Lines
	; X86-SSE-NEXT: pushl %esi			; X86-SSE-NEXT: pushl %esi
	; X86-SSE-NEXT: .cfi_def_cfa_offset 8			; X86-SSE-NEXT: .cfi_def_cfa_offset 8
	; X86-SSE-NEXT: .cfi_offset %esi, -8			; X86-SSE-NEXT: .cfi_offset %esi, -8
	; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %edx			; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %edx
	; X86-SSE-NEXT: movl c, %esi			; X86-SSE-NEXT: movl c, %esi
	; X86-SSE-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero			; X86-SSE-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; X86-SSE-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero			; X86-SSE-NEXT: pxor %xmm1, %xmm1
	; X86-SSE-NEXT: pxor %xmm2, %xmm2			; X86-SSE-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; X86-SSE-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]			; X86-SSE-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; X86-SSE-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]			; X86-SSE-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; X86-SSE-NEXT: pmullw %xmm0, %xmm1			; X86-SSE-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7]
	; X86-SSE-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]			; X86-SSE-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3]
	; X86-SSE-NEXT: movdqu %xmm1, (%esi,%ecx,4)			; X86-SSE-NEXT: pmaddwd %xmm0, %xmm2
				; X86-SSE-NEXT: movdqu %xmm2, (%esi,%ecx,4)
	; X86-SSE-NEXT: popl %esi			; X86-SSE-NEXT: popl %esi
	; X86-SSE-NEXT: retl			; X86-SSE-NEXT: retl
	;			;
	; X86-AVX-LABEL: mul_4xi8:			; X86-AVX-LABEL: mul_4xi8:
	; X86-AVX: # %bb.0: # %entry			; X86-AVX: # %bb.0: # %entry
	; X86-AVX-NEXT: pushl %esi			; X86-AVX-NEXT: pushl %esi
	; X86-AVX-NEXT: .cfi_def_cfa_offset 8			; X86-AVX-NEXT: .cfi_def_cfa_offset 8
	; X86-AVX-NEXT: .cfi_offset %esi, -8			; X86-AVX-NEXT: .cfi_offset %esi, -8
	; X86-AVX-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-AVX-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-AVX-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-AVX-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-AVX-NEXT: movl {{[0-9]+}}(%esp), %edx			; X86-AVX-NEXT: movl {{[0-9]+}}(%esp), %edx
	; X86-AVX-NEXT: movl c, %esi			; X86-AVX-NEXT: movl c, %esi
	; X86-AVX-NEXT: vpmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero			; X86-AVX-NEXT: vpmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
	; X86-AVX-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero			; X86-AVX-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
	; X86-AVX-NEXT: vpmulld %xmm0, %xmm1, %xmm0			; X86-AVX-NEXT: vpmulld %xmm0, %xmm1, %xmm0
	; X86-AVX-NEXT: vmovdqu %xmm0, (%esi,%ecx,4)			; X86-AVX-NEXT: vmovdqu %xmm0, (%esi,%ecx,4)
	; X86-AVX-NEXT: popl %esi			; X86-AVX-NEXT: popl %esi
	; X86-AVX-NEXT: retl			; X86-AVX-NEXT: retl
	;			;
	; X64-SSE-LABEL: mul_4xi8:			; X64-SSE-LABEL: mul_4xi8:
	; X64-SSE: # %bb.0: # %entry			; X64-SSE: # %bb.0: # %entry
	; X64-SSE-NEXT: movq {{.*}}(%rip), %rax			; X64-SSE-NEXT: movq {{.*}}(%rip), %rax
	; X64-SSE-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero			; X64-SSE-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; X64-SSE-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero			; X64-SSE-NEXT: pxor %xmm1, %xmm1
	; X64-SSE-NEXT: pxor %xmm2, %xmm2			; X64-SSE-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; X64-SSE-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]			; X64-SSE-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; X64-SSE-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]			; X64-SSE-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; X64-SSE-NEXT: pmullw %xmm0, %xmm1			; X64-SSE-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7]
	; X64-SSE-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]			; X64-SSE-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3]
	; X64-SSE-NEXT: movdqu %xmm1, (%rax,%rdx,4)			; X64-SSE-NEXT: pmaddwd %xmm0, %xmm2
				; X64-SSE-NEXT: movdqu %xmm2, (%rax,%rdx,4)
	; X64-SSE-NEXT: retq			; X64-SSE-NEXT: retq
	;			;
	; X64-AVX-LABEL: mul_4xi8:			; X64-AVX-LABEL: mul_4xi8:
	; X64-AVX: # %bb.0: # %entry			; X64-AVX: # %bb.0: # %entry
	; X64-AVX-NEXT: movq {{.*}}(%rip), %rax			; X64-AVX-NEXT: movq {{.*}}(%rip), %rax
	; X64-AVX-NEXT: vpmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero			; X64-AVX-NEXT: vpmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
	; X64-AVX-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero			; X64-AVX-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
	; X64-AVX-NEXT: vpmulld %xmm0, %xmm1, %xmm0			; X64-AVX-NEXT: vpmulld %xmm0, %xmm1, %xmm0
	▲ Show 20 Lines • Show All 2,050 Lines • ▼ Show 20 Lines
	; X86-SSE-NEXT: divl %ecx			; X86-SSE-NEXT: divl %ecx
	; X86-SSE-NEXT: movd %edx, %xmm0			; X86-SSE-NEXT: movd %edx, %xmm0
	; X86-SSE-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]			; X86-SSE-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
	; X86-SSE-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]			; X86-SSE-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; X86-SSE-NEXT: xorl %eax, %eax			; X86-SSE-NEXT: xorl %eax, %eax
	; X86-SSE-NEXT: xorl %edx, %edx			; X86-SSE-NEXT: xorl %edx, %edx
	; X86-SSE-NEXT: divl (%eax)			; X86-SSE-NEXT: divl (%eax)
	; X86-SSE-NEXT: movd %edx, %xmm0			; X86-SSE-NEXT: movd %edx, %xmm0
	; X86-SSE-NEXT: movdqa {{.*#+}} xmm2 = [8199,8199,8199,8199]			; X86-SSE-NEXT: pmaddwd {{\.LCPI.*}}, %xmm1
	; X86-SSE-NEXT: pshufd {{.*#+}} xmm3 = xmm1[1,1,3,3]
	; X86-SSE-NEXT: pmuludq %xmm2, %xmm1
	; X86-SSE-NEXT: pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
	; X86-SSE-NEXT: pmuludq %xmm2, %xmm3
	; X86-SSE-NEXT: pshufd {{.*#+}} xmm2 = xmm3[0,2,2,3]
	; X86-SSE-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1]
	; X86-SSE-NEXT: movl $8199, %eax # imm = 0x2007			; X86-SSE-NEXT: movl $8199, %eax # imm = 0x2007
	; X86-SSE-NEXT: movd %eax, %xmm2			; X86-SSE-NEXT: movd %eax, %xmm2
	; X86-SSE-NEXT: pmuludq %xmm0, %xmm2			; X86-SSE-NEXT: pmuludq %xmm0, %xmm2
	; X86-SSE-NEXT: movd %xmm2, (%eax)			; X86-SSE-NEXT: movd %xmm2, (%eax)
	; X86-SSE-NEXT: movdqa %xmm1, (%eax)			; X86-SSE-NEXT: movdqa %xmm1, (%eax)
	; X86-SSE-NEXT: retl			; X86-SSE-NEXT: retl
	;			;
	; X86-AVX1-LABEL: PR34947:			; X86-AVX1-LABEL: PR34947:
	▲ Show 20 Lines • Show All 177 Lines • ▼ Show 20 Lines
	; X64-SSE-NEXT: divl %ecx			; X64-SSE-NEXT: divl %ecx
	; X64-SSE-NEXT: movd %edx, %xmm0			; X64-SSE-NEXT: movd %edx, %xmm0
	; X64-SSE-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]			; X64-SSE-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
	; X64-SSE-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]			; X64-SSE-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; X64-SSE-NEXT: xorl %eax, %eax			; X64-SSE-NEXT: xorl %eax, %eax
	; X64-SSE-NEXT: xorl %edx, %edx			; X64-SSE-NEXT: xorl %edx, %edx
	; X64-SSE-NEXT: divl (%rax)			; X64-SSE-NEXT: divl (%rax)
	; X64-SSE-NEXT: movd %edx, %xmm0			; X64-SSE-NEXT: movd %edx, %xmm0
	; X64-SSE-NEXT: movdqa {{.*#+}} xmm2 = [8199,8199,8199,8199]			; X64-SSE-NEXT: pmaddwd {{.*}}(%rip), %xmm1
	; X64-SSE-NEXT: pshufd {{.*#+}} xmm3 = xmm1[1,1,3,3]
	; X64-SSE-NEXT: pmuludq %xmm2, %xmm1
	; X64-SSE-NEXT: pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
	; X64-SSE-NEXT: pmuludq %xmm2, %xmm3
	; X64-SSE-NEXT: pshufd {{.*#+}} xmm2 = xmm3[0,2,2,3]
	; X64-SSE-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1]
	; X64-SSE-NEXT: movl $8199, %eax # imm = 0x2007			; X64-SSE-NEXT: movl $8199, %eax # imm = 0x2007
	; X64-SSE-NEXT: movd %eax, %xmm2			; X64-SSE-NEXT: movd %eax, %xmm2
	; X64-SSE-NEXT: pmuludq %xmm0, %xmm2			; X64-SSE-NEXT: pmuludq %xmm0, %xmm2
	; X64-SSE-NEXT: movd %xmm2, (%rax)			; X64-SSE-NEXT: movd %xmm2, (%rax)
	; X64-SSE-NEXT: movdqa %xmm1, (%rax)			; X64-SSE-NEXT: movdqa %xmm1, (%rax)
	; X64-SSE-NEXT: retq			; X64-SSE-NEXT: retq
	;			;
	; X64-AVX1-LABEL: PR34947:			; X64-AVX1-LABEL: PR34947:
	▲ Show 20 Lines • Show All 141 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/slow-pmulld.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=i386-unknown-unknown -mcpu=silvermont \| FileCheck %s --check-prefix=CHECK32			; RUN: llc < %s -mtriple=i386-unknown-unknown -mcpu=silvermont \| FileCheck %s --check-prefix=CHECK32
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=silvermont \| FileCheck %s --check-prefix=CHECK64			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=silvermont \| FileCheck %s --check-prefix=CHECK64
	; RUN: llc < %s -mtriple=i386-unknown-unknown -mattr=+sse4.1 \| FileCheck %s --check-prefix=SSE4-32			; RUN: llc < %s -mtriple=i386-unknown-unknown -mattr=+sse4.1 \| FileCheck %s --check-prefix=SSE4-32
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4.1 \| FileCheck %s --check-prefix=SSE4-64			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4.1 \| FileCheck %s --check-prefix=SSE4-64

	; Make sure that the slow-pmulld feature can be used without SSE4.1.			; Make sure that the slow-pmulld feature can be used without SSE4.1.
	; RUN: llc < %s -mtriple=i386-unknown-unknown -mcpu=silvermont -mattr=-sse4.1			; RUN: llc < %s -mtriple=i386-unknown-unknown -mcpu=silvermont -mattr=-sse4.1

	define <4 x i32> @foo(<4 x i8> %A) {			define <4 x i32> @foo(<4 x i8> %A) {
	; CHECK32-LABEL: foo:			; CHECK32-LABEL: foo:
	; CHECK32: # %bb.0:			; CHECK32: # %bb.0:
	; CHECK32-NEXT: pshufb {{.*#+}} xmm0 = xmm0[0],zero,xmm0[4],zero,xmm0[8],zero,xmm0[12],zero,xmm0[u,u,u,u,u,u,u,u]			; CHECK32-NEXT: pand {{\.LCPI.*}}, %xmm0
	; CHECK32-NEXT: movdqa {{.*#+}} xmm1 = <18778,18778,18778,18778,u,u,u,u>			; CHECK32-NEXT: pmaddwd {{\.LCPI.*}}, %xmm0
	; CHECK32-NEXT: movdqa %xmm0, %xmm2
	; CHECK32-NEXT: pmullw %xmm1, %xmm0
	; CHECK32-NEXT: pmulhw %xmm1, %xmm2
	; CHECK32-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
	; CHECK32-NEXT: retl			; CHECK32-NEXT: retl
	;			;
	; CHECK64-LABEL: foo:			; CHECK64-LABEL: foo:
	; CHECK64: # %bb.0:			; CHECK64: # %bb.0:
	; CHECK64-NEXT: pshufb {{.*#+}} xmm0 = xmm0[0],zero,xmm0[4],zero,xmm0[8],zero,xmm0[12],zero,xmm0[u,u,u,u,u,u,u,u]			; CHECK64-NEXT: pand {{.*}}(%rip), %xmm0
	; CHECK64-NEXT: movdqa {{.*#+}} xmm1 = <18778,18778,18778,18778,u,u,u,u>			; CHECK64-NEXT: pmaddwd {{.*}}(%rip), %xmm0
	; CHECK64-NEXT: movdqa %xmm0, %xmm2
	; CHECK64-NEXT: pmullw %xmm1, %xmm0
	; CHECK64-NEXT: pmulhw %xmm1, %xmm2
	; CHECK64-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
	; CHECK64-NEXT: retq			; CHECK64-NEXT: retq
	;			;
	; SSE4-32-LABEL: foo:			; SSE4-32-LABEL: foo:
	; SSE4-32: # %bb.0:			; SSE4-32: # %bb.0:
	; SSE4-32-NEXT: pand {{\.LCPI.*}}, %xmm0			; SSE4-32-NEXT: pand {{\.LCPI.*}}, %xmm0
	; SSE4-32-NEXT: pmulld {{\.LCPI.*}}, %xmm0			; SSE4-32-NEXT: pmulld {{\.LCPI.*}}, %xmm0
	; SSE4-32-NEXT: retl			; SSE4-32-NEXT: retl
	;			;
	Show All 38 Lines