This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Add v16i8/v32i8 multiplication support
ClosedPublic

Authored by RKSimon on Apr 20 2015, 11:11 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
delena
andreadb

Commits

rG4f683c264a6d: [X86][SSE] Add v16i8/v32i8 multiplication support
rL235837: [X86][SSE] Add v16i8/v32i8 multiplication support

Summary

Patch to allow int8 vectors to be multiplied on the SSE unit instead of being scalarized.

The patch sign extends the i8 lanes to i16, uses the SSE2 pmullw multiplication instruction, then packs the lower byte from each result.

Once vpackuswb zmm support is present this should also work for v64i8 multiplication on AVX512BW targets.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 24030.Apr 20 2015, 11:11 AM

RKSimon retitled this revision from to [X86][SSE] Add v16i8/v32i8 multiplication support.

RKSimon updated this object.

RKSimon edited the test plan for this revision. (Show Details)

RKSimon added reviewers: qcolombet, andreadb, spatel, delena.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: Unknown Object (MLST).

I think that this sequence is good for SSE2. But for SSE4, AVX2 and definitely AVX-512 we can find a better chain. See my comments inside.

lib/Target/X86/X86ISelLowering.cpp
15899	There is a more optimal way for sign extend on SSE4, AVX2, at least for lower part. just VPMOVSXBW. And for AVX-512 (skx) we have truncate from W to B. So I suggest to write more generic code and then lower it according to target: ALo = sign extend lower part of A from "B" to "W" (ISD::SIGN_EXTEND) BLo = sign extend lower part of B from "B" to "W" multiply ALo * BLo shift the whole vector A right to put the high part instead of low (VPALIGNR) do the same with AHi, BHi use ISD::TRUNCATE for writing result back you can optimize truncate/extend according to the target capabilities

Thanks Elena for the review. I've updated the patch with your suggestions for SSE2/SSE4.1/AVX2 specific optimizations. If AVX512BW support for vpmovsxbw (zmm) and vpmovwb (xmm,ymm,zmm) (TRUNCATE) were added I could include support for v64i8 as well.

Reviewing this updated patch, it is quite bulky. Something that I'm considering is postponing this and improving support for SSE2/SSE41 for SIGN_EXTEND and SIGN_EXTEND_VECTOR_INREG first which would permit all of their specific code to be removed.

The code is correct. I just suggest to divide it in 2-3 functions,
like LowerMUL_v16i8_sse41(), LowerMUL_v16i8_sse2() or you can choose better names.
But it is only my suggestion, you can commit the code as is if you want.

Closed by commit rL235837: [X86][SSE] Add v16i8/v32i8 multiplication support (authored by RKSimon). · Explain WhyApr 27 2015, 12:59 AM

This revision was automatically updated to reflect the committed changes.

Thanks Elena. I've submitted the patch as is. I have an upcoming patch for sign extension that will simplify much of this code and should make any additional effort to split off SSE2/SSE41 implementations redundant.

Revision Contents

Path

Size

lib/

Target/

X86/

	X86ISelLowering.cpp
	X86ISelLowering.cpp (revision 235314)

35 lines

test/

CodeGen/

X86/

	avx2-arith.ll
	avx2-arith.ll (revision 235314)

20 lines

	pmul.ll
	pmul.ll (revision 235314)

71 lines

Diff 24030

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 796 Lines • ▼ Show 20 Lines	if (!TM.Options.UseSoftFloat && Subtarget->hasSSE2()) {
addRegisterClass(MVT::v8i16, &X86::VR128RegClass);		addRegisterClass(MVT::v8i16, &X86::VR128RegClass);
addRegisterClass(MVT::v4i32, &X86::VR128RegClass);		addRegisterClass(MVT::v4i32, &X86::VR128RegClass);
addRegisterClass(MVT::v2i64, &X86::VR128RegClass);		addRegisterClass(MVT::v2i64, &X86::VR128RegClass);

setOperationAction(ISD::ADD, MVT::v16i8, Legal);		setOperationAction(ISD::ADD, MVT::v16i8, Legal);
setOperationAction(ISD::ADD, MVT::v8i16, Legal);		setOperationAction(ISD::ADD, MVT::v8i16, Legal);
setOperationAction(ISD::ADD, MVT::v4i32, Legal);		setOperationAction(ISD::ADD, MVT::v4i32, Legal);
setOperationAction(ISD::ADD, MVT::v2i64, Legal);		setOperationAction(ISD::ADD, MVT::v2i64, Legal);
		setOperationAction(ISD::MUL, MVT::v16i8, Custom);
setOperationAction(ISD::MUL, MVT::v4i32, Custom);		setOperationAction(ISD::MUL, MVT::v4i32, Custom);
setOperationAction(ISD::MUL, MVT::v2i64, Custom);		setOperationAction(ISD::MUL, MVT::v2i64, Custom);
setOperationAction(ISD::UMUL_LOHI, MVT::v4i32, Custom);		setOperationAction(ISD::UMUL_LOHI, MVT::v4i32, Custom);
setOperationAction(ISD::SMUL_LOHI, MVT::v4i32, Custom);		setOperationAction(ISD::SMUL_LOHI, MVT::v4i32, Custom);
setOperationAction(ISD::MULHU, MVT::v8i16, Legal);		setOperationAction(ISD::MULHU, MVT::v8i16, Legal);
setOperationAction(ISD::MULHS, MVT::v8i16, Legal);		setOperationAction(ISD::MULHS, MVT::v8i16, Legal);
setOperationAction(ISD::SUB, MVT::v16i8, Legal);		setOperationAction(ISD::SUB, MVT::v16i8, Legal);
setOperationAction(ISD::SUB, MVT::v8i16, Legal);		setOperationAction(ISD::SUB, MVT::v8i16, Legal);
▲ Show 20 Lines • Show All 304 Lines • ▼ Show 20 Lines	if (Subtarget->hasInt256()) {
setOperationAction(ISD::SUB, MVT::v4i64, Legal);		setOperationAction(ISD::SUB, MVT::v4i64, Legal);
setOperationAction(ISD::SUB, MVT::v8i32, Legal);		setOperationAction(ISD::SUB, MVT::v8i32, Legal);
setOperationAction(ISD::SUB, MVT::v16i16, Legal);		setOperationAction(ISD::SUB, MVT::v16i16, Legal);
setOperationAction(ISD::SUB, MVT::v32i8, Legal);		setOperationAction(ISD::SUB, MVT::v32i8, Legal);

setOperationAction(ISD::MUL, MVT::v4i64, Custom);		setOperationAction(ISD::MUL, MVT::v4i64, Custom);
setOperationAction(ISD::MUL, MVT::v8i32, Legal);		setOperationAction(ISD::MUL, MVT::v8i32, Legal);
setOperationAction(ISD::MUL, MVT::v16i16, Legal);		setOperationAction(ISD::MUL, MVT::v16i16, Legal);
// Don't lower v32i8 because there is no 128-bit byte mul		setOperationAction(ISD::MUL, MVT::v32i8, Custom);

setOperationAction(ISD::UMUL_LOHI, MVT::v8i32, Custom);		setOperationAction(ISD::UMUL_LOHI, MVT::v8i32, Custom);
setOperationAction(ISD::SMUL_LOHI, MVT::v8i32, Custom);		setOperationAction(ISD::SMUL_LOHI, MVT::v8i32, Custom);
setOperationAction(ISD::MULHU, MVT::v16i16, Legal);		setOperationAction(ISD::MULHU, MVT::v16i16, Legal);
setOperationAction(ISD::MULHS, MVT::v16i16, Legal);		setOperationAction(ISD::MULHS, MVT::v16i16, Legal);

// The custom lowering for UINT_TO_FP for v8i32 becomes interesting		// The custom lowering for UINT_TO_FP for v8i32 becomes interesting
// when we have a 256bit-wide blend with immediate.		// when we have a 256bit-wide blend with immediate.
Show All 32 Lines	if (Subtarget->hasInt256()) {
setOperationAction(ISD::SUB, MVT::v4i64, Custom);		setOperationAction(ISD::SUB, MVT::v4i64, Custom);
setOperationAction(ISD::SUB, MVT::v8i32, Custom);		setOperationAction(ISD::SUB, MVT::v8i32, Custom);
setOperationAction(ISD::SUB, MVT::v16i16, Custom);		setOperationAction(ISD::SUB, MVT::v16i16, Custom);
setOperationAction(ISD::SUB, MVT::v32i8, Custom);		setOperationAction(ISD::SUB, MVT::v32i8, Custom);

setOperationAction(ISD::MUL, MVT::v4i64, Custom);		setOperationAction(ISD::MUL, MVT::v4i64, Custom);
setOperationAction(ISD::MUL, MVT::v8i32, Custom);		setOperationAction(ISD::MUL, MVT::v8i32, Custom);
setOperationAction(ISD::MUL, MVT::v16i16, Custom);		setOperationAction(ISD::MUL, MVT::v16i16, Custom);
// Don't lower v32i8 because there is no 128-bit byte mul		setOperationAction(ISD::MUL, MVT::v32i8, Custom);
}		}

// In the customized shift lowering, the legal cases in AVX2 will be		// In the customized shift lowering, the legal cases in AVX2 will be
// recognized.		// recognized.
setOperationAction(ISD::SRL, MVT::v4i64, Custom);		setOperationAction(ISD::SRL, MVT::v4i64, Custom);
setOperationAction(ISD::SRL, MVT::v8i32, Custom);		setOperationAction(ISD::SRL, MVT::v8i32, Custom);

setOperationAction(ISD::SHL, MVT::v4i64, Custom);		setOperationAction(ISD::SHL, MVT::v4i64, Custom);
▲ Show 20 Lines • Show All 8,690 Lines • ▼ Show 20 Lines	static SDValue lower256BitVectorShuffle(SDValue Op, SDValue V1, SDValue V2,
ArrayRef<int> Mask = SVOp->getMask();		ArrayRef<int> Mask = SVOp->getMask();

// If we have a single input to the zero element, insert that into V1 if we		// If we have a single input to the zero element, insert that into V1 if we
// can do so cheaply.		// can do so cheaply.
int NumElts = VT.getVectorNumElements();		int NumElts = VT.getVectorNumElements();
int NumV2Elements = std::count_if(Mask.begin(), Mask.end(), [NumElts](int M) {		int NumV2Elements = std::count_if(Mask.begin(), Mask.end(), [NumElts](int M) {
return M >= NumElts;		return M >= NumElts;
});		});

if (NumV2Elements == 1 && Mask[0] >= NumElts)		if (NumV2Elements == 1 && Mask[0] >= NumElts)
if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(		if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(
DL, VT, V1, V2, Mask, Subtarget, DAG))		DL, VT, V1, V2, Mask, Subtarget, DAG))
return Insertion;		return Insertion;

// There is a really nice hard cut-over between AVX1 and AVX2 that means we can		// There is a really nice hard cut-over between AVX1 and AVX2 that means we can
// check for those subtargets here and avoid much of the subtarget querying in		// check for those subtargets here and avoid much of the subtarget querying in
// the per-vector-type lowering routines. With AVX1 we have essentially zero		// the per-vector-type lowering routines. With AVX1 we have essentially zero
▲ Show 20 Lines • Show All 735 Lines • ▼ Show 20 Lines	if (VT.is256BitVector() && IdxVal == 0) {
// doing anyway after extracting to a 128-bit vector.		// doing anyway after extracting to a 128-bit vector.
if ((Subtarget->hasAVX() && (EltVT == MVT::f64 \|\| EltVT == MVT::f32)) \|\|		if ((Subtarget->hasAVX() && (EltVT == MVT::f64 \|\| EltVT == MVT::f32)) \|\|
(Subtarget->hasAVX2() && EltVT == MVT::i32)) {		(Subtarget->hasAVX2() && EltVT == MVT::i32)) {
SDValue N1Vec = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, N1);		SDValue N1Vec = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, N1);
N2 = DAG.getIntPtrConstant(1);		N2 = DAG.getIntPtrConstant(1);
return DAG.getNode(X86ISD::BLENDI, dl, VT, N0, N1Vec, N2);		return DAG.getNode(X86ISD::BLENDI, dl, VT, N0, N1Vec, N2);
}		}
}		}

// Get the desired 128-bit vector chunk.		// Get the desired 128-bit vector chunk.
SDValue V = Extract128BitVector(N0, IdxVal, DAG, dl);		SDValue V = Extract128BitVector(N0, IdxVal, DAG, dl);

// Insert the element into the desired chunk.		// Insert the element into the desired chunk.
unsigned NumEltsIn128 = 128 / EltVT.getSizeInBits();		unsigned NumEltsIn128 = 128 / EltVT.getSizeInBits();
unsigned IdxIn128 = IdxVal - (IdxVal / NumEltsIn128) * NumEltsIn128;		unsigned IdxIn128 = IdxVal - (IdxVal / NumEltsIn128) * NumEltsIn128;

V = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, V.getValueType(), V, N1,		V = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, V.getValueType(), V, N1,
▲ Show 20 Lines • Show All 5,245 Lines • ▼ Show 20 Lines	static SDValue LowerMUL(SDValue Op, const X86Subtarget *Subtarget,

// Decompose 256-bit ops into smaller 128-bit ops.		// Decompose 256-bit ops into smaller 128-bit ops.
if (VT.is256BitVector() && !Subtarget->hasInt256())		if (VT.is256BitVector() && !Subtarget->hasInt256())
return Lower256IntArith(Op, DAG);		return Lower256IntArith(Op, DAG);

SDValue A = Op.getOperand(0);		SDValue A = Op.getOperand(0);
SDValue B = Op.getOperand(1);		SDValue B = Op.getOperand(1);

		// Lower v16i8/v32i8 mul as promotion to v8i16/v16i16 vector
		// pairs, multiply and truncate.
		if (VT == MVT::v16i8 \|\| VT == MVT::v32i8) {
		MVT ExVT = (VT == MVT::v16i8 ? MVT::v8i16 : MVT::v16i16);
		delenaUnsubmitted Not Done Reply Inline Actions There is a more optimal way for sign extend on SSE4, AVX2, at least for lower part. just VPMOVSXBW. And for AVX-512 (skx) we have truncate from W to B. So I suggest to write more generic code and then lower it according to target: ALo = sign extend lower part of A from "B" to "W" (ISD::SIGN_EXTEND) BLo = sign extend lower part of B from "B" to "W" multiply ALo * BLo shift the whole vector A right to put the high part instead of low (VPALIGNR) do the same with AHi, BHi use ISD::TRUNCATE for writing result back you can optimize truncate/extend according to the target capabilities delena: There is a more optimal way for sign extend on SSE4, AVX2, at least for lower part. just…
		// Extract the lo parts, sign extend to i16 and multiply
		SDValue ALo = DAG.getNode(X86ISD::UNPCKL, dl, VT, A, A);
		SDValue BLo = DAG.getNode(X86ISD::UNPCKL, dl, VT, B, B);
		ALo = DAG.getNode(ISD::BITCAST, dl, ExVT, ALo);
		BLo = DAG.getNode(ISD::BITCAST, dl, ExVT, BLo);
		ALo = DAG.getNode(ISD::SRA, dl, ExVT, ALo, DAG.getConstant(8, ExVT));
		BLo = DAG.getNode(ISD::SRA, dl, ExVT, BLo, DAG.getConstant(8, ExVT));
		SDValue RLo = DAG.getNode(ISD::MUL, dl, ExVT, ALo, BLo);
		// Extract the hi parts, sign extend to i16 and multiply
		SDValue AHi = DAG.getNode(X86ISD::UNPCKH, dl, VT, A, A);
		SDValue BHi = DAG.getNode(X86ISD::UNPCKH, dl, VT, B, B);
		AHi = DAG.getNode(ISD::BITCAST, dl, ExVT, AHi);
		BHi = DAG.getNode(ISD::BITCAST, dl, ExVT, BHi);
		AHi = DAG.getNode(ISD::SRA, dl, ExVT, AHi, DAG.getConstant(8, ExVT));
		BHi = DAG.getNode(ISD::SRA, dl, ExVT, BHi, DAG.getConstant(8, ExVT));
		SDValue RHi = DAG.getNode(ISD::MUL, dl, ExVT, AHi, BHi);
		// Mask the lower 8bits of the lo/hi results and pack
		RLo = DAG.getNode(ISD::AND, dl, ExVT, RLo, DAG.getConstant(255, ExVT));
		RHi = DAG.getNode(ISD::AND, dl, ExVT, RHi, DAG.getConstant(255, ExVT));
		return DAG.getNode(X86ISD::PACKUS, dl, VT, RLo, RHi);
		}

// Lower v4i32 mul as 2x shuffle, 2x pmuludq, 2x shuffle.		// Lower v4i32 mul as 2x shuffle, 2x pmuludq, 2x shuffle.
if (VT == MVT::v4i32) {		if (VT == MVT::v4i32) {
assert(Subtarget->hasSSE2() && !Subtarget->hasSSE41() &&		assert(Subtarget->hasSSE2() && !Subtarget->hasSSE41() &&
"Should not custom lower when pmuldq is available!");		"Should not custom lower when pmuldq is available!");

// Extract the odd parts.		// Extract the odd parts.
static const int UnpackMask[] = { 1, -1, 3, -1 };		static const int UnpackMask[] = { 1, -1, 3, -1 };
SDValue Aodds = DAG.getVectorShuffle(VT, dl, A, A, UnpackMask);		SDValue Aodds = DAG.getVectorShuffle(VT, dl, A, A, UnpackMask);
▲ Show 20 Lines • Show All 8,821 Lines • Show Last 20 Lines

test/CodeGen/X86/avx2-arith.ll

	Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
	}			}

	; CHECK: vpmullw %ymm			; CHECK: vpmullw %ymm
	define <16 x i16> @test_vpmullw(<16 x i16> %i, <16 x i16> %j) nounwind readnone {			define <16 x i16> @test_vpmullw(<16 x i16> %i, <16 x i16> %j) nounwind readnone {
	%x = mul <16 x i16> %i, %j			%x = mul <16 x i16> %i, %j
	ret <16 x i16> %x			ret <16 x i16> %x
	}			}

				; CHECK: mul-v32i8
				; CHECK: vpunpckhbw {{.*#+}} ymm2 = ymm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31]
				; CHECK-NEXT: vpsraw $8, %ymm2, %ymm2
				; CHECK-NEXT: vpunpckhbw {{.*#+}} ymm3 = ymm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31]
				; CHECK-NEXT: vpsraw $8, %ymm3, %ymm3
				; CHECK-NEXT: vpmullw %ymm2, %ymm3, %ymm2
				; CHECK-NEXT: vmovdqa {{.*#+}} ymm3 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
				; CHECK-NEXT: vpand %ymm3, %ymm2, %ymm2
				; CHECK-NEXT: vpunpcklbw {{.*#+}} ymm1 = ymm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23,23]
				; CHECK-NEXT: vpsraw $8, %ymm1, %ymm1
				; CHECK-NEXT: vpunpcklbw {{.*#+}} ymm0 = ymm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23,23]
				; CHECK-NEXT: vpsraw $8, %ymm0, %ymm0
				; CHECK-NEXT: vpmullw %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpand %ymm3, %ymm0, %ymm0
				; CHECK-NEXT: vpackuswb %ymm2, %ymm0, %ymm0
				define <32 x i8> @mul-v32i8(<32 x i8> %i, <32 x i8> %j) nounwind readnone {
				%x = mul <32 x i8> %i, %j
				ret <32 x i8> %x
				}

	; CHECK: mul-v4i64			; CHECK: mul-v4i64
	; CHECK: vpmuludq %ymm			; CHECK: vpmuludq %ymm
	; CHECK-NEXT: vpsrlq $32, %ymm			; CHECK-NEXT: vpsrlq $32, %ymm
	; CHECK-NEXT: vpmuludq %ymm			; CHECK-NEXT: vpmuludq %ymm
	; CHECK-NEXT: vpsllq $32, %ymm			; CHECK-NEXT: vpsllq $32, %ymm
	; CHECK-NEXT: vpaddq %ymm			; CHECK-NEXT: vpaddq %ymm
	; CHECK-NEXT: vpsrlq $32, %ymm			; CHECK-NEXT: vpsrlq $32, %ymm
	; CHECK-NEXT: vpmuludq %ymm			; CHECK-NEXT: vpmuludq %ymm
	▲ Show 20 Lines • Show All 98 Lines • Show Last 20 Lines

test/CodeGen/X86/pmul.ll

	; RUN: llc < %s -mtriple=x86_64-unknown-unknown \| FileCheck %s --check-prefix=ALL --check-prefix=SSE2			; RUN: llc < %s -mtriple=x86_64-unknown-unknown \| FileCheck %s --check-prefix=ALL --check-prefix=SSE2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=sse4.1 \| FileCheck %s --check-prefix=ALL --check-prefix=SSE41			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=sse4.1 \| FileCheck %s --check-prefix=ALL --check-prefix=SSE41

				define <16 x i8> @mul8c(<16 x i8> %i) nounwind {
				; ALL-LABEL: mul8c:
				; ALL: # BB#0: # %entry
				; ALL-NEXT: movdqa {{.*#+}} xmm1 = [117,117,117,117,117,117,117,117,117,117,117,117,117,117,117,117]
				; ALL-NEXT: movdqa %xmm1, %xmm2
				; ALL-NEXT: punpckhbw {{.*#+}} xmm2 = xmm2[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
				; ALL-NEXT: psraw $8, %xmm2
				; ALL-NEXT: movdqa %xmm0, %xmm3
				; ALL-NEXT: punpckhbw {{.*#+}} xmm3 = xmm3[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
				; ALL-NEXT: psraw $8, %xmm3
				; ALL-NEXT: pmullw %xmm2, %xmm3
				; ALL-NEXT: movdqa {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255]
				; ALL-NEXT: pand %xmm2, %xmm3
				; ALL-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; ALL-NEXT: psraw $8, %xmm1
				; ALL-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; ALL-NEXT: psraw $8, %xmm0
				; ALL-NEXT: pmullw %xmm1, %xmm0
				; ALL-NEXT: pand %xmm2, %xmm0
				; ALL-NEXT: packuswb %xmm3, %xmm0
				; ALL-NEXT: retq
				entry:
				%A = mul <16 x i8> %i, < i8 117, i8 117, i8 117, i8 117, i8 117, i8 117, i8 117, i8 117, i8 117, i8 117, i8 117, i8 117, i8 117, i8 117, i8 117, i8 117 >
				ret <16 x i8> %A
				}

				define <8 x i16> @mul16c(<8 x i16> %i) nounwind {
				; ALL-LABEL: mul16c:
				; ALL: # BB#0: # %entry
				; ALL-NEXT: pmullw {{.*}}(%rip), %xmm0
				; ALL-NEXT: retq
				entry:
				%A = mul <8 x i16> %i, < i16 117, i16 117, i16 117, i16 117, i16 117, i16 117, i16 117, i16 117 >
				ret <8 x i16> %A
				}

	define <4 x i32> @a(<4 x i32> %i) nounwind {			define <4 x i32> @a(<4 x i32> %i) nounwind {
	; SSE2-LABEL: a:			; SSE2-LABEL: a:
	; SSE2: # BB#0: # %entry			; SSE2: # BB#0: # %entry
	; SSE2-NEXT: movdqa {{.*#+}} xmm1 = [117,117,117,117]			; SSE2-NEXT: movdqa {{.*#+}} xmm1 = [117,117,117,117]
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,3,3]			; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,3,3]
	; SSE2-NEXT: pmuludq %xmm1, %xmm0			; SSE2-NEXT: pmuludq %xmm1, %xmm0
	; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]			; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; SSE2-NEXT: pmuludq %xmm1, %xmm2			; SSE2-NEXT: pmuludq %xmm1, %xmm2
	Show All 25 Lines
	; ALL-NEXT: psllq $32, %xmm0			; ALL-NEXT: psllq $32, %xmm0
	; ALL-NEXT: paddq %xmm2, %xmm0			; ALL-NEXT: paddq %xmm2, %xmm0
	; ALL-NEXT: retq			; ALL-NEXT: retq
	entry:			entry:
	%A = mul <2 x i64> %i, < i64 117, i64 117 >			%A = mul <2 x i64> %i, < i64 117, i64 117 >
	ret <2 x i64> %A			ret <2 x i64> %A
	}			}

				define <16 x i8> @mul8(<16 x i8> %i, <16 x i8> %j) nounwind {
				; ALL-LABEL: mul8:
				; ALL: # BB#0: # %entry
				; ALL-NEXT: movdqa %xmm1, %xmm2
				; ALL-NEXT: punpckhbw {{.*#+}} xmm2 = xmm2[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
				; ALL-NEXT: psraw $8, %xmm2
				; ALL-NEXT: movdqa %xmm0, %xmm3
				; ALL-NEXT: punpckhbw {{.*#+}} xmm3 = xmm3[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
				; ALL-NEXT: psraw $8, %xmm3
				; ALL-NEXT: pmullw %xmm2, %xmm3
				; ALL-NEXT: movdqa {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255]
				; ALL-NEXT: pand %xmm2, %xmm3
				; ALL-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; ALL-NEXT: psraw $8, %xmm1
				; ALL-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; ALL-NEXT: psraw $8, %xmm0
				; ALL-NEXT: pmullw %xmm1, %xmm0
				; ALL-NEXT: pand %xmm2, %xmm0
				; ALL-NEXT: packuswb %xmm3, %xmm0
				; ALL-NEXT: retq
				entry:
				%A = mul <16 x i8> %i, %j
				ret <16 x i8> %A
				}

				define <8 x i16> @mul16(<8 x i16> %i, <8 x i16> %j) nounwind {
				; ALL-LABEL: mul16:
				; ALL: # BB#0: # %entry
				; ALL-NEXT: pmullw %xmm1, %xmm0
				; ALL-NEXT: retq
				entry:
				%A = mul <8 x i16> %i, %j
				ret <8 x i16> %A
				}

	define <4 x i32> @c(<4 x i32> %i, <4 x i32> %j) nounwind {			define <4 x i32> @c(<4 x i32> %i, <4 x i32> %j) nounwind {
	; SSE2-LABEL: c:			; SSE2-LABEL: c:
	; SSE2: # BB#0: # %entry			; SSE2: # BB#0: # %entry
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,3,3]			; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,3,3]
	; SSE2-NEXT: pmuludq %xmm1, %xmm0			; SSE2-NEXT: pmuludq %xmm1, %xmm0
	; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]			; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]			; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
	; SSE2-NEXT: pmuludq %xmm2, %xmm1			; SSE2-NEXT: pmuludq %xmm2, %xmm1
	▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines