This is an archive of the discontinued LLVM Phabricator instance.

[x86] enhance matching of pmaddwd
ClosedPublic

Authored by spatel on Mar 29 2021, 12:12 PM.

Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
pengfei

Commits

rGe694e19a7931: [x86] enhance matching of pmaddwd

Summary

This was crashing with the example from:
https://llvm.org/PR49716
...and I hopefully stubbed that out with a283d7258360 , but as we can see from the SSE vs. AVX code, I think we need to try harder to match the pattern.

This matcher code was adapted from another pmadd pattern match in D49636, but it needs different ops to deal with size mismatches.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Mar 29 2021, 12:12 PM

Herald added subscribers: hiraditya, mcrosier. · View Herald TranscriptMar 29 2021, 12:12 PM

spatel requested review of this revision.Mar 29 2021, 12:12 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2021, 12:12 PM

This seems reasonable. Is it possible we also need to handle the case where all the extract indices from a different subvector of the input? It's hard to say from the test case that we started from since it appears to have a logic mistake that exposed the bug in the first place. It used "sizeof sizeof(r) in its loop control which caused the loop bounds to be 4. Seems like it should have been sizeof(r)/sizeof(int16_t).

Harbormaster completed remote builds in B96174: Diff 333957.Mar 29 2021, 12:47 PM

In D99531#2656756, @craig.topper wrote:

This seems reasonable. Is it possible we also need to handle the case where all the extract indices from a different subvector of the input? It's hard to say from the test case that we started from since it appears to have a logic mistake that exposed the bug in the first place. It used "sizeof sizeof(r) in its loop control which caused the loop bounds to be 4. Seems like it should have been sizeof(r)/sizeof(int16_t).

Right - that source looks unintended.
We could go further in matching, but it gets harder to keep track of the indices and then calculate the offsets, and it doesn't seem as likely to come up in practice. This case seemed plausible from valid source even (!), so that's why I figured we should try to get it.

In D99531#2656843, @spatel wrote:

In D99531#2656756, @craig.topper wrote:

This seems reasonable. Is it possible we also need to handle the case where all the extract indices from a different subvector of the input? It's hard to say from the test case that we started from since it appears to have a logic mistake that exposed the bug in the first place. It used "sizeof sizeof(r) in its loop control which caused the loop bounds to be 4. Seems like it should have been sizeof(r)/sizeof(int16_t).

Right - that source looks unintended.
We could go further in matching, but it gets harder to keep track of the indices and then calculate the offsets, and it doesn't seem as likely to come up in practice. This case seemed plausible from valid source even (!), so that's why I figured we should try to get it.

I'm inclined to agree. So this patch LGTM.

This revision is now accepted and ready to land.Mar 29 2021, 1:40 PM

Closed by commit rGe694e19a7931: [x86] enhance matching of pmaddwd (authored by spatel). · Explain WhyMar 30 2021, 4:28 AM

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rGe694e19a7931: [x86] enhance matching of pmaddwd.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

23 lines

test/

CodeGen/

X86/

madd.ll

112 lines

Diff 334107

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 49,039 Lines • ▼ Show 20 Lines	for (unsigned i = 0; i != N00.getNumOperands(); ++i) {
SDValue N10In = N10Elt.getOperand(0);		SDValue N10In = N10Elt.getOperand(0);
SDValue N11In = N11Elt.getOperand(0);		SDValue N11In = N11Elt.getOperand(0);

// First time we find an input capture it.		// First time we find an input capture it.
if (!In0) {		if (!In0) {
In0 = N00In;		In0 = N00In;
In1 = N01In;		In1 = N01In;

// The input vector sizes must match the output.		// The input vectors must be at least as wide as the output.
// TODO: Insert cast ops to allow different types.		// If they are larger than the output, we extract subvector below.
if (In0.getValueSizeInBits() != VT.getSizeInBits() \|\|		if (In0.getValueSizeInBits() < VT.getSizeInBits() \|\|
In1.getValueSizeInBits() != VT.getSizeInBits())		In1.getValueSizeInBits() < VT.getSizeInBits())
return SDValue();		return SDValue();
}		}
// Mul is commutative so the input vectors can be in any order.		// Mul is commutative so the input vectors can be in any order.
// Canonicalize to make the compares easier.		// Canonicalize to make the compares easier.
if (In0 != N00In)		if (In0 != N00In)
std::swap(N00In, N01In);		std::swap(N00In, N01In);
if (In0 != N10In)		if (In0 != N10In)
std::swap(N10In, N11In);		std::swap(N10In, N11In);
if (In0 != N00In \|\| In1 != N01In \|\| In0 != N10In \|\| In1 != N11In)		if (In0 != N00In \|\| In1 != N01In \|\| In0 != N10In \|\| In1 != N11In)
return SDValue();		return SDValue();
}		}

auto PMADDBuilder = [](SelectionDAG &DAG, const SDLoc &DL,		auto PMADDBuilder = [](SelectionDAG &DAG, const SDLoc &DL,
ArrayRef<SDValue> Ops) {		ArrayRef<SDValue> Ops) {
// Shrink by adding truncate nodes and let DAGCombine fold with the
// sources.
EVT OpVT = Ops[0].getValueType();		EVT OpVT = Ops[0].getValueType();
assert(OpVT.getScalarType() == MVT::i16 &&		assert(OpVT.getScalarType() == MVT::i16 &&
"Unexpected scalar element type");		"Unexpected scalar element type");
assert(OpVT == Ops[1].getValueType() && "Operands' types mismatch");		assert(OpVT == Ops[1].getValueType() && "Operands' types mismatch");
EVT ResVT = EVT::getVectorVT(*DAG.getContext(), MVT::i32,		EVT ResVT = EVT::getVectorVT(*DAG.getContext(), MVT::i32,
OpVT.getVectorNumElements() / 2);		OpVT.getVectorNumElements() / 2);
return DAG.getNode(X86ISD::VPMADDWD, DL, ResVT, Ops[0], Ops[1]);		return DAG.getNode(X86ISD::VPMADDWD, DL, ResVT, Ops[0], Ops[1]);
};		};

		// If the output is narrower than an input, extract the low part of the input
		// vector.
		EVT OutVT16 = EVT::getVectorVT(*DAG.getContext(), MVT::i16,
		VT.getVectorNumElements() * 2);
		if (OutVT16.bitsLT(In0.getValueType())) {
		In0 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, OutVT16, In0,
		DAG.getIntPtrConstant(0, DL));
		}
		if (OutVT16.bitsLT(In1.getValueType())) {
		In1 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, OutVT16, In1,
		DAG.getIntPtrConstant(0, DL));
		}
return SplitOpsAndApply(DAG, Subtarget, DL, VT, { In0, In1 },		return SplitOpsAndApply(DAG, Subtarget, DL, VT, { In0, In1 },
PMADDBuilder);		PMADDBuilder);
}		}

static SDValue combineAddOrSubToHADDorHSUB(SDNode *N, SelectionDAG &DAG,		static SDValue combineAddOrSubToHADDorHSUB(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDValue Op0 = N->getOperand(0);		SDValue Op0 = N->getOperand(0);
▲ Show 20 Lines • Show All 2,652 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/madd.ll

Show First 20 Lines • Show All 3,046 Lines • ▼ Show 20 Lines	middle.block:
ret i32 %11		ret i32 %11
}		}

; PR49716 - https://llvm.org/PR49716		; PR49716 - https://llvm.org/PR49716

define <4 x i32> @input_size_mismatch(<16 x i16> %x, <16 x i16>* %p) {		define <4 x i32> @input_size_mismatch(<16 x i16> %x, <16 x i16>* %p) {
; SSE2-LABEL: input_size_mismatch:		; SSE2-LABEL: input_size_mismatch:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: movdqa (%rdi), %xmm1		; SSE2-NEXT: pmaddwd (%rdi), %xmm0
; SSE2-NEXT: pshuflw {{.*#+}} xmm2 = xmm0[0,2,2,3,4,5,6,7]
; SSE2-NEXT: pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,6,6,7]
; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,2,2,3]
; SSE2-NEXT: pshuflw {{.*#+}} xmm0 = xmm0[3,1,2,3,4,5,6,7]
; SSE2-NEXT: pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,7,5,6,7]
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
; SSE2-NEXT: pshuflw {{.*#+}} xmm0 = xmm0[1,0,3,2,4,5,6,7]
; SSE2-NEXT: pshuflw {{.*#+}} xmm3 = xmm1[0,2,2,3,4,5,6,7]
; SSE2-NEXT: pshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,4,6,6,7]
; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm3[0,2,2,3]
; SSE2-NEXT: pshuflw {{.*#+}} xmm1 = xmm1[3,1,2,3,4,5,6,7]
; SSE2-NEXT: pshufhw {{.*#+}} xmm1 = xmm1[0,1,2,3,7,5,6,7]
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
; SSE2-NEXT: pshuflw {{.*#+}} xmm1 = xmm1[1,0,3,2,4,5,6,7]
; SSE2-NEXT: movdqa %xmm2, %xmm4
; SSE2-NEXT: pmulhw %xmm3, %xmm4
; SSE2-NEXT: pmullw %xmm3, %xmm2
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
; SSE2-NEXT: movdqa %xmm0, %xmm3
; SSE2-NEXT: pmulhw %xmm1, %xmm3
; SSE2-NEXT: pmullw %xmm1, %xmm0
; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
; SSE2-NEXT: paddd %xmm2, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; AVX-LABEL: input_size_mismatch:		; AVX-LABEL: input_size_mismatch:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vmovdqa {{.*#+}} xmm1 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]		; AVX-NEXT: vpmaddwd (%rdi), %xmm0, %xmm0
; AVX-NEXT: vpshufb %xmm1, %xmm0, %xmm2
; AVX-NEXT: vmovdqa {{.*#+}} xmm3 = [2,3,6,7,10,11,14,15,14,15,10,11,12,13,14,15]
; AVX-NEXT: vpshufb %xmm3, %xmm0, %xmm0
; AVX-NEXT: vmovdqa (%rdi), %xmm4
; AVX-NEXT: vpshufb %xmm1, %xmm4, %xmm1
; AVX-NEXT: vpshufb %xmm3, %xmm4, %xmm3
; AVX-NEXT: vpmovsxwd %xmm2, %xmm2
; AVX-NEXT: vpmovsxwd %xmm0, %xmm0
; AVX-NEXT: vpmovsxwd %xmm1, %xmm1
; AVX-NEXT: vpmulld %xmm1, %xmm2, %xmm1
; AVX-NEXT: vpmovsxwd %xmm3, %xmm2
; AVX-NEXT: vpmulld %xmm2, %xmm0, %xmm0
; AVX-NEXT: vpaddd %xmm0, %xmm1, %xmm0
; AVX-NEXT: vzeroupper		; AVX-NEXT: vzeroupper
; AVX-NEXT: retq		; AVX-NEXT: retq
%y = load <16 x i16>, <16 x i16>* %p, align 32		%y = load <16 x i16>, <16 x i16>* %p, align 32
%x0 = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>		%x0 = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%x1 = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>		%x1 = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%y0 = shufflevector <16 x i16> %y, <16 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>		%y0 = shufflevector <16 x i16> %y, <16 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%y1 = shufflevector <16 x i16> %y, <16 x i16> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>		%y1 = shufflevector <16 x i16> %y, <16 x i16> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%sx0 = sext <4 x i16> %x0 to <4 x i32>		%sx0 = sext <4 x i16> %x0 to <4 x i32>
Show All 9 Lines
define <4 x i32> @output_size_mismatch(<16 x i16> %x, <16 x i16> %y) {		define <4 x i32> @output_size_mismatch(<16 x i16> %x, <16 x i16> %y) {
; SSE2-LABEL: output_size_mismatch:		; SSE2-LABEL: output_size_mismatch:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: pmaddwd %xmm2, %xmm0		; SSE2-NEXT: pmaddwd %xmm2, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; AVX-LABEL: output_size_mismatch:		; AVX-LABEL: output_size_mismatch:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vmovdqa {{.*#+}} xmm2 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]		; AVX-NEXT: vpmaddwd %xmm1, %xmm0, %xmm0
; AVX-NEXT: vpshufb %xmm2, %xmm0, %xmm3
; AVX-NEXT: vmovdqa {{.*#+}} xmm4 = [2,3,6,7,10,11,14,15,14,15,10,11,12,13,14,15]
; AVX-NEXT: vpshufb %xmm4, %xmm0, %xmm0
; AVX-NEXT: vpshufb %xmm2, %xmm1, %xmm2
; AVX-NEXT: vpshufb %xmm4, %xmm1, %xmm1
; AVX-NEXT: vpmovsxwd %xmm3, %xmm3
; AVX-NEXT: vpmovsxwd %xmm0, %xmm0
; AVX-NEXT: vpmovsxwd %xmm2, %xmm2
; AVX-NEXT: vpmulld %xmm2, %xmm3, %xmm2
; AVX-NEXT: vpmovsxwd %xmm1, %xmm1
; AVX-NEXT: vpmulld %xmm1, %xmm0, %xmm0
; AVX-NEXT: vpaddd %xmm0, %xmm2, %xmm0
; AVX-NEXT: vzeroupper		; AVX-NEXT: vzeroupper
; AVX-NEXT: retq		; AVX-NEXT: retq
%x0 = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>		%x0 = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%x1 = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>		%x1 = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%y0 = shufflevector <16 x i16> %y, <16 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>		%y0 = shufflevector <16 x i16> %y, <16 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%y1 = shufflevector <16 x i16> %y, <16 x i16> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>		%y1 = shufflevector <16 x i16> %y, <16 x i16> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%sx0 = sext <4 x i16> %x0 to <4 x i32>		%sx0 = sext <4 x i16> %x0 to <4 x i32>
%sx1 = sext <4 x i16> %x1 to <4 x i32>		%sx1 = sext <4 x i16> %x1 to <4 x i32>
%sy0 = sext <4 x i16> %y0 to <4 x i32>		%sy0 = sext <4 x i16> %y0 to <4 x i32>
%sy1 = sext <4 x i16> %y1 to <4 x i32>		%sy1 = sext <4 x i16> %y1 to <4 x i32>
%m0 = mul <4 x i32> %sx0, %sy0		%m0 = mul <4 x i32> %sx0, %sy0
%m1 = mul <4 x i32> %sx1, %sy1		%m1 = mul <4 x i32> %sx1, %sy1
%r = add <4 x i32> %m0, %m1		%r = add <4 x i32> %m0, %m1
ret <4 x i32> %r		ret <4 x i32> %r
}		}

		define <4 x i32> @output_size_mismatch_high_subvector(<16 x i16> %x, <16 x i16> %y) {
		; SSE2-LABEL: output_size_mismatch_high_subvector:
		; SSE2: # %bb.0:
		; SSE2-NEXT: movdqa %xmm1, %xmm0
		; SSE2-NEXT: pmaddwd %xmm2, %xmm0
		; SSE2-NEXT: retq
		;
		; AVX1-LABEL: output_size_mismatch_high_subvector:
		; AVX1: # %bb.0:
		; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
		; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0
		; AVX1-NEXT: vpshufb %xmm2, %xmm0, %xmm3
		; AVX1-NEXT: vmovdqa {{.*#+}} xmm4 = [2,3,6,7,10,11,14,15,14,15,10,11,12,13,14,15]
		; AVX1-NEXT: vpshufb %xmm4, %xmm0, %xmm0
		; AVX1-NEXT: vpshufb %xmm2, %xmm1, %xmm2
		; AVX1-NEXT: vpshufb %xmm4, %xmm1, %xmm1
		; AVX1-NEXT: vpmovsxwd %xmm3, %xmm3
		; AVX1-NEXT: vpmovsxwd %xmm0, %xmm0
		; AVX1-NEXT: vpmovsxwd %xmm2, %xmm2
		; AVX1-NEXT: vpmulld %xmm2, %xmm3, %xmm2
		; AVX1-NEXT: vpmovsxwd %xmm1, %xmm1
		; AVX1-NEXT: vpmulld %xmm1, %xmm0, %xmm0
		; AVX1-NEXT: vpaddd %xmm0, %xmm2, %xmm0
		; AVX1-NEXT: vzeroupper
		; AVX1-NEXT: retq
		;
		; AVX256-LABEL: output_size_mismatch_high_subvector:
		; AVX256: # %bb.0:
		; AVX256-NEXT: vmovdqa {{.*#+}} xmm2 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
		; AVX256-NEXT: vextracti128 $1, %ymm0, %xmm0
		; AVX256-NEXT: vpshufb %xmm2, %xmm0, %xmm3
		; AVX256-NEXT: vmovdqa {{.*#+}} xmm4 = [2,3,6,7,10,11,14,15,14,15,10,11,12,13,14,15]
		; AVX256-NEXT: vpshufb %xmm4, %xmm0, %xmm0
		; AVX256-NEXT: vpshufb %xmm2, %xmm1, %xmm2
		; AVX256-NEXT: vpshufb %xmm4, %xmm1, %xmm1
		; AVX256-NEXT: vpmovsxwd %xmm3, %xmm3
		; AVX256-NEXT: vpmovsxwd %xmm0, %xmm0
		; AVX256-NEXT: vpmovsxwd %xmm2, %xmm2
		; AVX256-NEXT: vpmulld %xmm2, %xmm3, %xmm2
		; AVX256-NEXT: vpmovsxwd %xmm1, %xmm1
		; AVX256-NEXT: vpmulld %xmm1, %xmm0, %xmm0
		; AVX256-NEXT: vpaddd %xmm0, %xmm2, %xmm0
		; AVX256-NEXT: vzeroupper
		; AVX256-NEXT: retq
		%x0 = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 8, i32 10, i32 12, i32 14>
		%x1 = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 9, i32 11, i32 13, i32 15>
		%y0 = shufflevector <16 x i16> %y, <16 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
		%y1 = shufflevector <16 x i16> %y, <16 x i16> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
		%sx0 = sext <4 x i16> %x0 to <4 x i32>
		%sx1 = sext <4 x i16> %x1 to <4 x i32>
		%sy0 = sext <4 x i16> %y0 to <4 x i32>
		%sy1 = sext <4 x i16> %y1 to <4 x i32>
		%m0 = mul <4 x i32> %sx0, %sy0
		%m1 = mul <4 x i32> %sx1, %sy1
		%r = add <4 x i32> %m0, %m1
		ret <4 x i32> %r
		}