This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
2
TargetLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
madd.ll
-
shrink_vmul.ll
5
slow-pmulld.ll

Differential D123163

[TLI] `TargetLowering::SimplifyDemandedVectorElts()`: narrowing bitcast: fill known zero elts from known src bits
ClosedPublic

Authored by lebedev.ri on Apr 5 2022, 4:38 PM.

Download Raw Diff

Details

Reviewers

RKSimon

Commits

rG34ce9fd864b5: [TLI] `TargetLowering::SimplifyDemandedVectorElts()`: narrowing bitcast: fill…

Summary

E.g. in

%i0 = zext <2 x i8> to <2 x i16>
%i1 = bitcast <2 x i16> to <4 x i8>

the %i0's zero bits are known to be 0xFF00 (upper half of every element is known zero),
but no elements are known to be zero, and for %i1, we don't know anything about zero bits,
but the elements under 0b1010 mask are known to be zero (i.e. the odd elements).

But, we didn't perform such a propagation.

I think i wrote that right?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lebedev.ri created this revision.Apr 5 2022, 4:38 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2022, 4:38 PM

Herald added subscribers: pengfei, hiraditya. · View Herald Transcript

lebedev.ri requested review of this revision.Apr 5 2022, 4:38 PM

lebedev.ri edited the summary of this revision. (Show Details)Apr 5 2022, 4:47 PM

Harbormaster completed remote builds in B158084: Diff 420656.Apr 5 2022, 6:16 PM

RKSimon added inline comments.Apr 6 2022, 1:51 AM

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
2771	Known is only guaranteed to be correct for the requested SrcDemandedBits/SrcDemandedElts - is that going to work in a different way to the DemandedElts check below? Sorry I haven't investigated properly yet, but we might have to check that we were demanding those src bits?

lebedev.ri added inline comments.Apr 6 2022, 2:04 AM

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
2754–2759	`SrcDemandedBits` was just synthesized from the `DemandedElts`, so i don't see any need in checking the former?

RKSimon added inline comments.Apr 6 2022, 3:29 AM

llvm/test/CodeGen/X86/slow-pmulld.ll
264	regression?

lebedev.ri added inline comments.Apr 6 2022, 3:29 AM

llvm/test/CodeGen/X86/slow-pmulld.ll
284	With AVX1, we can only broadcast i32 load to XMM/YMM, and i64 to YMM, but with AVX2 we can broadcast i8/i16/i32 load to XMM/YMM. Is `lowerBuildVectorAsBroadcast()` intentionally not doing that, because such i8/i16 broadcasts are slow, or is that a bug?

lebedev.ri added inline comments.Apr 6 2022, 3:32 AM

llvm/test/CodeGen/X86/slow-pmulld.ll
284	~~but with AVX2 we can broadcast i8/i16/i32 load to XMM/YMM.~~ but with AVX2 we can broadcast i8/i16/i32/64 load to XMM/YMM.

LGTM - cheers

llvm/test/CodeGen/X86/slow-pmulld.ll
284	Its one of the many annoyances of lowering constant broadcasts that I mentioned on https://github.com/llvm/llvm-project/issues/54743 - I think this is because AVX512 doesn't have many ops that do i8/i16 broadcast-memory folds? Let's accept it for now.

This revision is now accepted and ready to land.Apr 6 2022, 3:40 AM

lebedev.ri added inline comments.Apr 6 2022, 3:45 AM

llvm/test/CodeGen/X86/slow-pmulld.ll
284	I have a patch, but it shows a number of load folding failures instead :S

In D123163#3432268, @RKSimon wrote:

LGTM - cheers

Okay then, thank you for the review!

This revision was landed with ongoing or failed builds.Apr 6 2022, 4:19 AM

Closed by commit rG34ce9fd864b5: [TLI] `TargetLowering::SimplifyDemandedVectorElts()`: narrowing bitcast: fill… (authored by lebedev.ri). · Explain Why

This revision was automatically updated to reflect the committed changes.

lebedev.ri added a commit: rG34ce9fd864b5: [TLI] `TargetLowering::SimplifyDemandedVectorElts()`: narrowing bitcast: fill….

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

TargetLowering.cpp

15 lines

test/

CodeGen/

X86/

madd.ll

3 lines

shrink_vmul.ll

8 lines

slow-pmulld.ll

36 lines

Diff 420780

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,745 Lines • ▼ Show 20 Lines	if ((NumElts % NumSrcElts) == 0) {
TLO, Depth + 1))		TLO, Depth + 1))
return true;		return true;

// Try calling SimplifyDemandedBits, converting demanded elts to the bits		// Try calling SimplifyDemandedBits, converting demanded elts to the bits
// of the large element.		// of the large element.
// TODO - bigendian once we have test coverage.		// TODO - bigendian once we have test coverage.
if (IsLE) {		if (IsLE) {
unsigned SrcEltSizeInBits = SrcVT.getScalarSizeInBits();		unsigned SrcEltSizeInBits = SrcVT.getScalarSizeInBits();
APInt SrcDemandedBits = APInt::getZero(SrcEltSizeInBits);		APInt SrcDemandedBits = APInt::getZero(SrcEltSizeInBits);
for (unsigned i = 0; i != NumElts; ++i)		for (unsigned i = 0; i != NumElts; ++i)
if (DemandedElts[i]) {		if (DemandedElts[i]) {
unsigned Ofs = (i % Scale) * EltSizeInBits;		unsigned Ofs = (i % Scale) * EltSizeInBits;
SrcDemandedBits.setBits(Ofs, Ofs + EltSizeInBits);		SrcDemandedBits.setBits(Ofs, Ofs + EltSizeInBits);
}		}
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions `SrcDemandedBits` was just synthesized from the `DemandedElts`, so i don't see any need in checking the former? lebedev.ri: `SrcDemandedBits` was just synthesized from the `DemandedElts`, so i don't see any need in…

KnownBits Known;		KnownBits Known;
if (SimplifyDemandedBits(Src, SrcDemandedBits, SrcDemandedElts, Known,		if (SimplifyDemandedBits(Src, SrcDemandedBits, SrcDemandedElts, Known,
TLO, Depth + 1))		TLO, Depth + 1))
return true;		return true;

		// The bitcast has split each wide element into a number of
		// narrow subelements. We have just computed the Known bits
		// for wide elements. See if element splitting results in
		// some subelements being zero. Only for demanded elements!
		for (unsigned SubElt = 0; SubElt != Scale; ++SubElt) {
		if (!Known.Zero.extractBits(EltSizeInBits, SubElt * EltSizeInBits)
		RKSimonUnsubmitted Not Done Reply Inline Actions Known is only guaranteed to be correct for the requested SrcDemandedBits/SrcDemandedElts - is that going to work in a different way to the DemandedElts check below? Sorry I haven't investigated properly yet, but we might have to check that we were demanding those src bits? RKSimon: Known is only guaranteed to be correct for the requested SrcDemandedBits/SrcDemandedElts - is…
		.isAllOnes())
		continue;
		for (unsigned SrcElt = 0; SrcElt != NumSrcElts; ++SrcElt) {
		unsigned Elt = Scale * SrcElt + SubElt;
		if (DemandedElts[Elt])
		KnownZero.setBit(Elt);
		}
		}
}		}

// If the src element is zero/undef then all the output elements will be -		// If the src element is zero/undef then all the output elements will be -
// only demanded elements are guaranteed to be correct.		// only demanded elements are guaranteed to be correct.
for (unsigned i = 0; i != NumSrcElts; ++i) {		for (unsigned i = 0; i != NumSrcElts; ++i) {
if (SrcDemandedElts[i]) {		if (SrcDemandedElts[i]) {
if (SrcZero[i])		if (SrcZero[i])
KnownZero.setBits(i * Scale, (i + 1) * Scale);		KnownZero.setBits(i * Scale, (i + 1) * Scale);
▲ Show 20 Lines • Show All 6,545 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/madd.ll

	Show First 20 Lines • Show All 2,064 Lines • ▼ Show 20 Lines
	; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,2],xmm0[1,3]			; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,2],xmm0[1,3]
	; SSE2-NEXT: paddd %xmm2, %xmm1			; SSE2-NEXT: paddd %xmm2, %xmm1
	; SSE2-NEXT: movdqa %xmm1, %xmm0			; SSE2-NEXT: movdqa %xmm1, %xmm0
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; AVX1-LABEL: pmaddwd_negative2:			; AVX1-LABEL: pmaddwd_negative2:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vpmovsxwd %xmm0, %xmm1			; AVX1-NEXT: vpmovsxwd %xmm0, %xmm1
	; AVX1-NEXT: vpxor %xmm2, %xmm2, %xmm2			; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4,4,5,5,6,6,7,7]
	; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
	; AVX1-NEXT: vpmaddwd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0			; AVX1-NEXT: vpmaddwd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
	; AVX1-NEXT: vpmulld {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1			; AVX1-NEXT: vpmulld {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
	; AVX1-NEXT: vphaddd %xmm0, %xmm1, %xmm0			; AVX1-NEXT: vphaddd %xmm0, %xmm1, %xmm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX256-LABEL: pmaddwd_negative2:			; AVX256-LABEL: pmaddwd_negative2:
	; AVX256: # %bb.0:			; AVX256: # %bb.0:
	; AVX256-NEXT: vpmovsxwd %xmm0, %ymm0			; AVX256-NEXT: vpmovsxwd %xmm0, %ymm0
	▲ Show 20 Lines • Show All 1,032 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/shrink_vmul.ll

	Show First 20 Lines • Show All 1,073 Lines • ▼ Show 20 Lines
	; X86-SSE-NEXT: pushl %esi			; X86-SSE-NEXT: pushl %esi
	; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %edx			; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %edx
	; X86-SSE-NEXT: movl c, %esi			; X86-SSE-NEXT: movl c, %esi
	; X86-SSE-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero			; X86-SSE-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; X86-SSE-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero			; X86-SSE-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; X86-SSE-NEXT: pxor %xmm2, %xmm2			; X86-SSE-NEXT: pxor %xmm2, %xmm2
	; X86-SSE-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]			; X86-SSE-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
	; X86-SSE-NEXT: pshuflw {{.*#+}} xmm0 = xmm0[0,1,1,3,4,5,6,7]			; X86-SSE-NEXT: pshuflw {{.*#+}} xmm1 = xmm1[0,1,1,3,4,5,6,7]
	; X86-SSE-NEXT: pmaddwd %xmm1, %xmm0			; X86-SSE-NEXT: pmaddwd %xmm0, %xmm1
	; X86-SSE-NEXT: movq %xmm0, (%esi,%ecx,4)			; X86-SSE-NEXT: movq %xmm1, (%esi,%ecx,4)
	; X86-SSE-NEXT: popl %esi			; X86-SSE-NEXT: popl %esi
	; X86-SSE-NEXT: retl			; X86-SSE-NEXT: retl
	;			;
	; X86-AVX-LABEL: mul_2xi16_sext:			; X86-AVX-LABEL: mul_2xi16_sext:
	; X86-AVX: # %bb.0: # %entry			; X86-AVX: # %bb.0: # %entry
	; X86-AVX-NEXT: pushl %esi			; X86-AVX-NEXT: pushl %esi
	; X86-AVX-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-AVX-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-AVX-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-AVX-NEXT: movl {{[0-9]+}}(%esp), %ecx
	▲ Show 20 Lines • Show All 1,382 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/slow-pmulld.ll

Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	; KNL-64-NEXT: retq
%z = zext <4 x i8> %A to <4 x i32>		%z = zext <4 x i8> %A to <4 x i32>
%m = mul nuw nsw <4 x i32> %z, <i32 18778, i32 18778, i32 18778, i32 18778>		%m = mul nuw nsw <4 x i32> %z, <i32 18778, i32 18778, i32 18778, i32 18778>
ret <4 x i32> %m		ret <4 x i32> %m
}		}

define <8 x i32> @test_mul_v8i32_v8i8(<8 x i8> %A) {		define <8 x i32> @test_mul_v8i32_v8i8(<8 x i8> %A) {
; SLM-LABEL: test_mul_v8i32_v8i8:		; SLM-LABEL: test_mul_v8i32_v8i8:
; SLM: # %bb.0:		; SLM: # %bb.0:
; SLM-NEXT: movdqa {{.*#+}} xmm2 = [18778,0,18778,0,18778,0,18778,0]		; SLM-NEXT: movdqa {{.*#+}} xmm2 = <18778,u,18778,u,18778,u,18778,u>
; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]		; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
; SLM-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SLM-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLM-NEXT: pmaddwd %xmm2, %xmm0		; SLM-NEXT: pmaddwd %xmm2, %xmm0
; SLM-NEXT: pmaddwd %xmm2, %xmm1		; SLM-NEXT: pmaddwd %xmm2, %xmm1
; SLM-NEXT: ret{{[l\|q]}}		; SLM-NEXT: ret{{[l\|q]}}
;		;
; SLOW-LABEL: test_mul_v8i32_v8i8:		; SLOW-LABEL: test_mul_v8i32_v8i8:
; SLOW: # %bb.0:		; SLOW: # %bb.0:
; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]		; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SLOW-NEXT: movdqa {{.*#+}} xmm2 = [18778,0,18778,0,18778,0,18778,0]		; SLOW-NEXT: movdqa {{.*#+}} xmm2 = <18778,u,18778,u,18778,u,18778,u>
; SLOW-NEXT: pmaddwd %xmm2, %xmm0		; SLOW-NEXT: pmaddwd %xmm2, %xmm0
; SLOW-NEXT: pmaddwd %xmm2, %xmm1		; SLOW-NEXT: pmaddwd %xmm2, %xmm1
; SLOW-NEXT: ret{{[l\|q]}}		; SLOW-NEXT: ret{{[l\|q]}}
;		;
; SSE4-LABEL: test_mul_v8i32_v8i8:		; SSE4-LABEL: test_mul_v8i32_v8i8:
; SSE4: # %bb.0:		; SSE4: # %bb.0:
; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]		; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SSE4-NEXT: movdqa {{.*#+}} xmm2 = [18778,0,18778,0,18778,0,18778,0]		; SSE4-NEXT: movdqa {{.*#+}} xmm2 = <18778,u,18778,u,18778,u,18778,u>
; SSE4-NEXT: pmaddwd %xmm2, %xmm0		; SSE4-NEXT: pmaddwd %xmm2, %xmm0
; SSE4-NEXT: pmaddwd %xmm2, %xmm1		; SSE4-NEXT: pmaddwd %xmm2, %xmm1
; SSE4-NEXT: ret{{[l\|q]}}		; SSE4-NEXT: ret{{[l\|q]}}
;		;
; AVX2-SLOW32-LABEL: test_mul_v8i32_v8i8:		; AVX2-SLOW32-LABEL: test_mul_v8i32_v8i8:
; AVX2-SLOW32: # %bb.0:		; AVX2-SLOW32: # %bb.0:
; AVX2-SLOW32-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero		; AVX2-SLOW32-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
; AVX2-SLOW32-NEXT: vpmaddwd {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0		; AVX2-SLOW32-NEXT: vpmaddwd {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	; KNL-64-NEXT: retq
%m = mul nuw nsw <8 x i32> %z, <i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778>		%m = mul nuw nsw <8 x i32> %z, <i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778>
ret <8 x i32> %m		ret <8 x i32> %m
}		}

define <16 x i32> @test_mul_v16i32_v16i8(<16 x i8> %A) {		define <16 x i32> @test_mul_v16i32_v16i8(<16 x i8> %A) {
; SLM-LABEL: test_mul_v16i32_v16i8:		; SLM-LABEL: test_mul_v16i32_v16i8:
; SLM: # %bb.0:		; SLM: # %bb.0:
; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]		; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]
; SLM-NEXT: movdqa {{.*#+}} xmm5 = [18778,0,18778,0,18778,0,18778,0]		; SLM-NEXT: movdqa {{.*#+}} xmm5 = <18778,u,18778,u,18778,u,18778,u>
; SLM-NEXT: pshufd {{.*#+}} xmm4 = xmm0[1,1,1,1]		; SLM-NEXT: pshufd {{.*#+}} xmm4 = xmm0[1,1,1,1]
; SLM-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; SLM-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SLM-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLM-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero,xmm4[2],zero,zero,zero,xmm4[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero,xmm4[2],zero,zero,zero,xmm4[3],zero,zero,zero
; SLM-NEXT: pmaddwd %xmm5, %xmm0		; SLM-NEXT: pmaddwd %xmm5, %xmm0
; SLM-NEXT: pmaddwd %xmm5, %xmm1		; SLM-NEXT: pmaddwd %xmm5, %xmm1
; SLM-NEXT: pmaddwd %xmm5, %xmm2		; SLM-NEXT: pmaddwd %xmm5, %xmm2
; SLM-NEXT: pmaddwd %xmm5, %xmm3		; SLM-NEXT: pmaddwd %xmm5, %xmm3
; SLM-NEXT: ret{{[l\|q]}}		; SLM-NEXT: ret{{[l\|q]}}
;		;
; SLOW-LABEL: test_mul_v16i32_v16i8:		; SLOW-LABEL: test_mul_v16i32_v16i8:
; SLOW: # %bb.0:		; SLOW: # %bb.0:
; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]		; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]		; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SLOW-NEXT: movdqa {{.*#+}} xmm4 = [18778,0,18778,0,18778,0,18778,0]		; SLOW-NEXT: movdqa {{.*#+}} xmm4 = <18778,u,18778,u,18778,u,18778,u>
; SLOW-NEXT: pmaddwd %xmm4, %xmm0		; SLOW-NEXT: pmaddwd %xmm4, %xmm0
; SLOW-NEXT: pmaddwd %xmm4, %xmm1		; SLOW-NEXT: pmaddwd %xmm4, %xmm1
; SLOW-NEXT: pmaddwd %xmm4, %xmm2		; SLOW-NEXT: pmaddwd %xmm4, %xmm2
; SLOW-NEXT: pmaddwd %xmm4, %xmm3		; SLOW-NEXT: pmaddwd %xmm4, %xmm3
; SLOW-NEXT: ret{{[l\|q]}}		; SLOW-NEXT: ret{{[l\|q]}}
;		;
; SSE4-LABEL: test_mul_v16i32_v16i8:		; SSE4-LABEL: test_mul_v16i32_v16i8:
; SSE4: # %bb.0:		; SSE4: # %bb.0:
; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]		; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]		; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SSE4-NEXT: movdqa {{.*#+}} xmm4 = [18778,0,18778,0,18778,0,18778,0]		; SSE4-NEXT: movdqa {{.*#+}} xmm4 = <18778,u,18778,u,18778,u,18778,u>
; SSE4-NEXT: pmaddwd %xmm4, %xmm0		; SSE4-NEXT: pmaddwd %xmm4, %xmm0
; SSE4-NEXT: pmaddwd %xmm4, %xmm1		; SSE4-NEXT: pmaddwd %xmm4, %xmm1
; SSE4-NEXT: pmaddwd %xmm4, %xmm2		; SSE4-NEXT: pmaddwd %xmm4, %xmm2
; SSE4-NEXT: pmaddwd %xmm4, %xmm3		; SSE4-NEXT: pmaddwd %xmm4, %xmm3
; SSE4-NEXT: ret{{[l\|q]}}		; SSE4-NEXT: ret{{[l\|q]}}
;		;
; AVX2-SLOW-LABEL: test_mul_v16i32_v16i8:		; AVX2-SLOW-LABEL: test_mul_v16i32_v16i8:
; AVX2-SLOW: # %bb.0:		; AVX2-SLOW: # %bb.0:
; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; AVX2-SLOW-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero		; AVX2-SLOW-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
; AVX2-SLOW-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero		; AVX2-SLOW-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
; AVX2-SLOW-NEXT: vpbroadcastd {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778]		; AVX2-SLOW-NEXT: vmovdqa {{.*#+}} ymm2 = <18778,u,18778,u,18778,u,18778,u,18778,u,18778,u,18778,u,18778,u>
		RKSimonUnsubmitted Not Done Reply Inline Actions regression? RKSimon: regression?
; AVX2-SLOW-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0		; AVX2-SLOW-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0
; AVX2-SLOW-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1		; AVX2-SLOW-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1
; AVX2-SLOW-NEXT: ret{{[l\|q]}}		; AVX2-SLOW-NEXT: ret{{[l\|q]}}
;		;
; AVX2-32-LABEL: test_mul_v16i32_v16i8:		; AVX2-32-LABEL: test_mul_v16i32_v16i8:
; AVX2-32: # %bb.0:		; AVX2-32: # %bb.0:
; AVX2-32-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; AVX2-32-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; AVX2-32-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero		; AVX2-32-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
; AVX2-32-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero		; AVX2-32-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
; AVX2-32-NEXT: vpbroadcastd {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778]		; AVX2-32-NEXT: vmovdqa {{.*#+}} ymm2 = <18778,u,18778,u,18778,u,18778,u,18778,u,18778,u,18778,u,18778,u>
; AVX2-32-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0		; AVX2-32-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0
; AVX2-32-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1		; AVX2-32-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1
; AVX2-32-NEXT: retl		; AVX2-32-NEXT: retl
;		;
; AVX2-64-LABEL: test_mul_v16i32_v16i8:		; AVX2-64-LABEL: test_mul_v16i32_v16i8:
; AVX2-64: # %bb.0:		; AVX2-64: # %bb.0:
; AVX2-64-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; AVX2-64-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; AVX2-64-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero		; AVX2-64-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
; AVX2-64-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero		; AVX2-64-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
; AVX2-64-NEXT: vpbroadcastd {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778]		; AVX2-64-NEXT: vmovdqa {{.*#+}} ymm2 = <18778,u,18778,u,18778,u,18778,u,18778,u,18778,u,18778,u,18778,u>
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions With AVX1, we can only broadcast i32 load to XMM/YMM, and i64 to YMM, but with AVX2 we can broadcast i8/i16/i32 load to XMM/YMM. Is `lowerBuildVectorAsBroadcast()` intentionally not doing that, because such i8/i16 broadcasts are slow, or is that a bug? lebedev.ri: With AVX1, we can only broadcast i32 load to XMM/YMM, and i64 to YMM, but with AVX2 we can…
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions ~~but with AVX2 we can broadcast i8/i16/i32 load to XMM/YMM.~~ but with AVX2 we can broadcast i8/i16/i32/64 load to XMM/YMM. lebedev.ri: ~~but with AVX2 we can broadcast i8/i16/i32 load to XMM/YMM.~~ but with AVX2 we can broadcast…
		RKSimonUnsubmitted Not Done Reply Inline Actions Its one of the many annoyances of lowering constant broadcasts that I mentioned on https://github.com/llvm/llvm-project/issues/54743 - I think this is because AVX512 doesn't have many ops that do i8/i16 broadcast-memory folds? Let's accept it for now. RKSimon: Its one of the many annoyances of lowering constant broadcasts that I mentioned on https…
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions I have a patch, but it shows a number of load folding failures instead :S lebedev.ri: I have a patch, but it shows a number of load folding failures instead :S
; AVX2-64-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0		; AVX2-64-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0
; AVX2-64-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1		; AVX2-64-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1
; AVX2-64-NEXT: retq		; AVX2-64-NEXT: retq
;		;
; AVX512DQ-32-LABEL: test_mul_v16i32_v16i8:		; AVX512DQ-32-LABEL: test_mul_v16i32_v16i8:
; AVX512DQ-32: # %bb.0:		; AVX512DQ-32: # %bb.0:
; AVX512DQ-32-NEXT: vpmovzxbd {{.*#+}} zmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[9],zero,zero,zero,xmm0[10],zero,zero,zero,xmm0[11],zero,zero,zero,xmm0[12],zero,zero,zero,xmm0[13],zero,zero,zero,xmm0[14],zero,zero,zero,xmm0[15],zero,zero,zero		; AVX512DQ-32-NEXT: vpmovzxbd {{.*#+}} zmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[9],zero,zero,zero,xmm0[10],zero,zero,zero,xmm0[11],zero,zero,zero,xmm0[12],zero,zero,zero,xmm0[13],zero,zero,zero,xmm0[14],zero,zero,zero,xmm0[15],zero,zero,zero
; AVX512DQ-32-NEXT: vpmulld {{\.?LCPI[0-9]+_[0-9]+}}{1to16}, %zmm0, %zmm0		; AVX512DQ-32-NEXT: vpmulld {{\.?LCPI[0-9]+_[0-9]+}}{1to16}, %zmm0, %zmm0
▲ Show 20 Lines • Show All 351 Lines • ▼ Show 20 Lines	; KNL-64-NEXT: retq
%z = zext <4 x i8> %A to <4 x i32>		%z = zext <4 x i8> %A to <4 x i32>
%m = mul nuw nsw <4 x i32> %z, <i32 18778, i32 18778, i32 18778, i32 18778>		%m = mul nuw nsw <4 x i32> %z, <i32 18778, i32 18778, i32 18778, i32 18778>
ret <4 x i32> %m		ret <4 x i32> %m
}		}

define <8 x i32> @test_mul_v8i32_v8i8_minsize(<8 x i8> %A) minsize {		define <8 x i32> @test_mul_v8i32_v8i8_minsize(<8 x i8> %A) minsize {
; SLM-LABEL: test_mul_v8i32_v8i8_minsize:		; SLM-LABEL: test_mul_v8i32_v8i8_minsize:
; SLM: # %bb.0:		; SLM: # %bb.0:
; SLM-NEXT: movdqa {{.*#+}} xmm2 = [18778,0,18778,0,18778,0,18778,0]		; SLM-NEXT: movdqa {{.*#+}} xmm2 = <18778,u,18778,u,18778,u,18778,u>
; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]		; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
; SLM-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SLM-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLM-NEXT: pmaddwd %xmm2, %xmm0		; SLM-NEXT: pmaddwd %xmm2, %xmm0
; SLM-NEXT: pmaddwd %xmm2, %xmm1		; SLM-NEXT: pmaddwd %xmm2, %xmm1
; SLM-NEXT: ret{{[l\|q]}}		; SLM-NEXT: ret{{[l\|q]}}
;		;
; SLOW-LABEL: test_mul_v8i32_v8i8_minsize:		; SLOW-LABEL: test_mul_v8i32_v8i8_minsize:
; SLOW: # %bb.0:		; SLOW: # %bb.0:
; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]		; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SLOW-NEXT: movdqa {{.*#+}} xmm2 = [18778,0,18778,0,18778,0,18778,0]		; SLOW-NEXT: movdqa {{.*#+}} xmm2 = <18778,u,18778,u,18778,u,18778,u>
; SLOW-NEXT: pmaddwd %xmm2, %xmm0		; SLOW-NEXT: pmaddwd %xmm2, %xmm0
; SLOW-NEXT: pmaddwd %xmm2, %xmm1		; SLOW-NEXT: pmaddwd %xmm2, %xmm1
; SLOW-NEXT: ret{{[l\|q]}}		; SLOW-NEXT: ret{{[l\|q]}}
;		;
; SSE4-LABEL: test_mul_v8i32_v8i8_minsize:		; SSE4-LABEL: test_mul_v8i32_v8i8_minsize:
; SSE4: # %bb.0:		; SSE4: # %bb.0:
; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]		; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SSE4-NEXT: movdqa {{.*#+}} xmm2 = [18778,0,18778,0,18778,0,18778,0]		; SSE4-NEXT: movdqa {{.*#+}} xmm2 = <18778,u,18778,u,18778,u,18778,u>
; SSE4-NEXT: pmaddwd %xmm2, %xmm0		; SSE4-NEXT: pmaddwd %xmm2, %xmm0
; SSE4-NEXT: pmaddwd %xmm2, %xmm1		; SSE4-NEXT: pmaddwd %xmm2, %xmm1
; SSE4-NEXT: ret{{[l\|q]}}		; SSE4-NEXT: ret{{[l\|q]}}
;		;
; AVX2-SLOW32-LABEL: test_mul_v8i32_v8i8_minsize:		; AVX2-SLOW32-LABEL: test_mul_v8i32_v8i8_minsize:
; AVX2-SLOW32: # %bb.0:		; AVX2-SLOW32: # %bb.0:
; AVX2-SLOW32-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero		; AVX2-SLOW32-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
; AVX2-SLOW32-NEXT: vpmaddwd {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0		; AVX2-SLOW32-NEXT: vpmaddwd {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	; KNL-64-NEXT: retq
%m = mul nuw nsw <8 x i32> %z, <i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778>		%m = mul nuw nsw <8 x i32> %z, <i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778, i32 18778>
ret <8 x i32> %m		ret <8 x i32> %m
}		}

define <16 x i32> @test_mul_v16i32_v16i8_minsize(<16 x i8> %A) minsize {		define <16 x i32> @test_mul_v16i32_v16i8_minsize(<16 x i8> %A) minsize {
; SLM-LABEL: test_mul_v16i32_v16i8_minsize:		; SLM-LABEL: test_mul_v16i32_v16i8_minsize:
; SLM: # %bb.0:		; SLM: # %bb.0:
; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]		; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]
; SLM-NEXT: movdqa {{.*#+}} xmm5 = [18778,0,18778,0,18778,0,18778,0]		; SLM-NEXT: movdqa {{.*#+}} xmm5 = <18778,u,18778,u,18778,u,18778,u>
; SLM-NEXT: pshufd {{.*#+}} xmm4 = xmm0[1,1,1,1]		; SLM-NEXT: pshufd {{.*#+}} xmm4 = xmm0[1,1,1,1]
; SLM-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; SLM-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; SLM-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SLM-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLM-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero,xmm4[2],zero,zero,zero,xmm4[3],zero,zero,zero		; SLM-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero,xmm4[2],zero,zero,zero,xmm4[3],zero,zero,zero
; SLM-NEXT: pmaddwd %xmm5, %xmm0		; SLM-NEXT: pmaddwd %xmm5, %xmm0
; SLM-NEXT: pmaddwd %xmm5, %xmm1		; SLM-NEXT: pmaddwd %xmm5, %xmm1
; SLM-NEXT: pmaddwd %xmm5, %xmm2		; SLM-NEXT: pmaddwd %xmm5, %xmm2
; SLM-NEXT: pmaddwd %xmm5, %xmm3		; SLM-NEXT: pmaddwd %xmm5, %xmm3
; SLM-NEXT: ret{{[l\|q]}}		; SLM-NEXT: ret{{[l\|q]}}
;		;
; SLOW-LABEL: test_mul_v16i32_v16i8_minsize:		; SLOW-LABEL: test_mul_v16i32_v16i8_minsize:
; SLOW: # %bb.0:		; SLOW: # %bb.0:
; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]		; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]		; SLOW-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SLOW-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SLOW-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SLOW-NEXT: movdqa {{.*#+}} xmm4 = [18778,0,18778,0,18778,0,18778,0]		; SLOW-NEXT: movdqa {{.*#+}} xmm4 = <18778,u,18778,u,18778,u,18778,u>
; SLOW-NEXT: pmaddwd %xmm4, %xmm0		; SLOW-NEXT: pmaddwd %xmm4, %xmm0
; SLOW-NEXT: pmaddwd %xmm4, %xmm1		; SLOW-NEXT: pmaddwd %xmm4, %xmm1
; SLOW-NEXT: pmaddwd %xmm4, %xmm2		; SLOW-NEXT: pmaddwd %xmm4, %xmm2
; SLOW-NEXT: pmaddwd %xmm4, %xmm3		; SLOW-NEXT: pmaddwd %xmm4, %xmm3
; SLOW-NEXT: ret{{[l\|q]}}		; SLOW-NEXT: ret{{[l\|q]}}
;		;
; SSE4-LABEL: test_mul_v16i32_v16i8_minsize:		; SSE4-LABEL: test_mul_v16i32_v16i8_minsize:
; SSE4: # %bb.0:		; SSE4: # %bb.0:
; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]		; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,3,3,3]
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm3 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm2 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]		; SSE4-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
; SSE4-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero		; SSE4-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
; SSE4-NEXT: movdqa {{.*#+}} xmm4 = [18778,0,18778,0,18778,0,18778,0]		; SSE4-NEXT: movdqa {{.*#+}} xmm4 = <18778,u,18778,u,18778,u,18778,u>
; SSE4-NEXT: pmaddwd %xmm4, %xmm0		; SSE4-NEXT: pmaddwd %xmm4, %xmm0
; SSE4-NEXT: pmaddwd %xmm4, %xmm1		; SSE4-NEXT: pmaddwd %xmm4, %xmm1
; SSE4-NEXT: pmaddwd %xmm4, %xmm2		; SSE4-NEXT: pmaddwd %xmm4, %xmm2
; SSE4-NEXT: pmaddwd %xmm4, %xmm3		; SSE4-NEXT: pmaddwd %xmm4, %xmm3
; SSE4-NEXT: ret{{[l\|q]}}		; SSE4-NEXT: ret{{[l\|q]}}
;		;
; AVX2-SLOW-LABEL: test_mul_v16i32_v16i8_minsize:		; AVX2-SLOW-LABEL: test_mul_v16i32_v16i8_minsize:
; AVX2-SLOW: # %bb.0:		; AVX2-SLOW: # %bb.0:
; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; AVX2-SLOW-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero		; AVX2-SLOW-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
; AVX2-SLOW-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero		; AVX2-SLOW-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
; AVX2-SLOW-NEXT: vpbroadcastd {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778]		; AVX2-SLOW-NEXT: vpbroadcastw {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778]
; AVX2-SLOW-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0		; AVX2-SLOW-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0
; AVX2-SLOW-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1		; AVX2-SLOW-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1
; AVX2-SLOW-NEXT: ret{{[l\|q]}}		; AVX2-SLOW-NEXT: ret{{[l\|q]}}
;		;
; AVX2-32-LABEL: test_mul_v16i32_v16i8_minsize:		; AVX2-32-LABEL: test_mul_v16i32_v16i8_minsize:
; AVX2-32: # %bb.0:		; AVX2-32: # %bb.0:
; AVX2-32-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; AVX2-32-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; AVX2-32-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero		; AVX2-32-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
; AVX2-32-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero		; AVX2-32-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
; AVX2-32-NEXT: vpbroadcastd {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778]		; AVX2-32-NEXT: vpbroadcastw {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778]
; AVX2-32-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0		; AVX2-32-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0
; AVX2-32-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1		; AVX2-32-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1
; AVX2-32-NEXT: retl		; AVX2-32-NEXT: retl
;		;
; AVX2-64-LABEL: test_mul_v16i32_v16i8_minsize:		; AVX2-64-LABEL: test_mul_v16i32_v16i8_minsize:
; AVX2-64: # %bb.0:		; AVX2-64: # %bb.0:
; AVX2-64-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]		; AVX2-64-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
; AVX2-64-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero		; AVX2-64-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
; AVX2-64-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero		; AVX2-64-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
; AVX2-64-NEXT: vpbroadcastd {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778]		; AVX2-64-NEXT: vpbroadcastw {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778]
; AVX2-64-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0		; AVX2-64-NEXT: vpmaddwd %ymm2, %ymm0, %ymm0
; AVX2-64-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1		; AVX2-64-NEXT: vpmaddwd %ymm2, %ymm1, %ymm1
; AVX2-64-NEXT: retq		; AVX2-64-NEXT: retq
;		;
; AVX512DQ-32-LABEL: test_mul_v16i32_v16i8_minsize:		; AVX512DQ-32-LABEL: test_mul_v16i32_v16i8_minsize:
; AVX512DQ-32: # %bb.0:		; AVX512DQ-32: # %bb.0:
; AVX512DQ-32-NEXT: vpmovzxbd {{.*#+}} zmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[9],zero,zero,zero,xmm0[10],zero,zero,zero,xmm0[11],zero,zero,zero,xmm0[12],zero,zero,zero,xmm0[13],zero,zero,zero,xmm0[14],zero,zero,zero,xmm0[15],zero,zero,zero		; AVX512DQ-32-NEXT: vpmovzxbd {{.*#+}} zmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[9],zero,zero,zero,xmm0[10],zero,zero,zero,xmm0[11],zero,zero,zero,xmm0[12],zero,zero,zero,xmm0[13],zero,zero,zero,xmm0[14],zero,zero,zero,xmm0[15],zero,zero,zero
; AVX512DQ-32-NEXT: vpmulld {{\.?LCPI[0-9]+_[0-9]+}}{1to16}, %zmm0, %zmm0		; AVX512DQ-32-NEXT: vpmulld {{\.?LCPI[0-9]+_[0-9]+}}{1to16}, %zmm0, %zmm0
▲ Show 20 Lines • Show All 211 Lines • Show Last 20 Lines