This is an archive of the discontinued LLVM Phabricator instance.

Use PMADDWD to expand reduction in a loop
ClosedPublic

Authored by danielcdh on Apr 4 2017, 2:05 PM.

Download Raw Diff

Details

Reviewers

RKSimon
zvi
wmi
davidxl
mkuper
hfinkel

Commits

rG58fa724494b6: Use PMADDWD to expand reduction in a loop
rL299776: Use PMADDWD to expand reduction in a loop

Summary

PMADDWD can help improve 8/16 bit integer mutliply-add operation performance for cases like:

for (int i = 0; i < count; i++)

a += x[i] * y[i];

Diff Detail

Build Status

Buildable 5375
Build 5375: arc lint + arc unit

Event Timeline

danielcdh created this revision.Apr 4 2017, 2:05 PM

Harbormaster completed remote builds in B5300: Diff 94119.Apr 4 2017, 2:05 PM

craig.topper added reviewers: RKSimon, zvi.Apr 4 2017, 2:10 PM

wmi added inline comments.Apr 4 2017, 3:46 PM

lib/Target/X86/X86ISelLowering.cpp
34588–34599	Maybe use std::swap, so that Op0 and Op1 are unnecessary. MulOp = N->getOperand(0); Phi = N->getOperand(1); if (MulOp.getOpcode() != ISD::MUL) { std::swap(MulOp, Phi); if (MulOp.getOpcode() != ISD::MUL) return SDValue(); }
34605–34607	Just realize maybe we need to prove no wrap before use PMADDUBSW? It is possible that i8i8+i8i8 will overflow for i16 and PMADDUBSW will generate saturate result. However, the original i32 operations will not have overflow issue.

remove the support for PMADDUBSW as it cannot handle overflow case.

lib/Target/X86/X86ISelLowering.cpp
34605–34607	Good catch! Looks like we should not use compiler to generate PMADDUBSW directly. Patch updated.

danielcdh retitled this revision from Support PMADDWD and PMADDUBSW to Use PMADDWD to expand reduction in a loop.Apr 4 2017, 5:02 PM

danielcdh edited the summary of this revision. (Show Details)

wmi added inline comments.Apr 5 2017, 10:06 AM

test/CodeGen/X86/madd.ll
25	The test can be simpler by removing the preheader and exit. just like what test/CodeGen/X86/sad.ll does. It will be easier if it is needed to extend the test and check different kinds of element types or different vector lengths.

Thanks for working on this patch. Regarding support for PMADDUBSW, can we match something like the following?

for (int i = 0; i < count; i++) {
  a = saturate(a + x[i] * y[i]);
}

In D31679#719786, @zvi wrote:
Thanks for working on this patch. Regarding support for PMADDUBSW, can we match something like the following?
for (int i = 0; i < count; i++) {
  a = saturate(a + x[i] * y[i]);
}

How does user specify "saturate"? Is it a general builtin in clang?

simplify test

Harbormaster completed remote builds in B5360: Diff 94376.Apr 6 2017, 8:31 AM

I suggest we leave the PMADDUBSW discussion for a separate patch.

Some minor comments inline.

lib/Target/X86/X86ISelLowering.cpp
34600	Since this version always uses PMADDWD, we no longer care about Mode at all, and this entire if can go away.
34630	Isn't Madd.getSimpleValueType() always MVT::i32?
34630	Just a sanity check - this is correct irrespective of whether the initial value for the reduction phi is 0 or not, right?
34708	Bikeshedding: maybe call this combineLoopMAddPattern() to match the other name?
34708	Sad -> Madd?

update

lib/Target/X86/X86ISelLowering.cpp
34630	I suppose yes.
34630	It should be vector type instead of scalar type.

Harbormaster completed remote builds in B5375: Diff 94474.Apr 6 2017, 7:25 PM

LGTM

lib/Target/X86/X86ISelLowering.cpp
34630	Argh, right, sorry.

This revision is now accepted and ready to land.Apr 6 2017, 10:00 PM

danielcdh closed this revision.Apr 7 2017, 8:54 AM

In D31679#720193, @danielcdh wrote:

How does user specify "saturate"? Is it a general builtin in clang?

I'm not aware of such a builtin and my snippet above was more of pseudo-code.
In c++ you may implement saturation as:

int sat_sint16(int x) {
  return std::min(32767, std::max(-32768, x));
}

AFAIK, the loop vectorizer will not vectorize the reduction for PMADDUBSW, so i agree with @mkuper to do this in a different patch,

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

47 lines

test/

CodeGen/

X86/

madd.ll

103 lines

Diff 94474

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 32,759 Lines • ▼ Show 20 Lines
	DAG.getConstant(-1ULL, DL, VT), NewCmp);			DAG.getConstant(-1ULL, DL, VT), NewCmp);

	// X - (Z == 0) --> sub X, (zext(sete Z, 0)) --> sbb X, 0, (cmp Z, 1)			// X - (Z == 0) --> sub X, (zext(sete Z, 0)) --> sbb X, 0, (cmp Z, 1)
	// X + (Z == 0) --> add X, (zext(sete Z, 0)) --> adc X, 0, (cmp Z, 1)			// X + (Z == 0) --> add X, (zext(sete Z, 0)) --> adc X, 0, (cmp Z, 1)
	return DAG.getNode(IsSub ? X86ISD::SBB : X86ISD::ADC, DL, VT, X,			return DAG.getNode(IsSub ? X86ISD::SBB : X86ISD::ADC, DL, VT, X,
	DAG.getConstant(0, DL, VT), NewCmp);			DAG.getConstant(0, DL, VT), NewCmp);
	}			}

				static SDValue combineLoopMAddPattern(SDNode *N, SelectionDAG &DAG,
				const X86Subtarget &Subtarget) {
				SDValue MulOp = N->getOperand(0);
				SDValue Phi = N->getOperand(1);

				if (MulOp.getOpcode() != ISD::MUL)
				std::swap(MulOp, Phi);
				if (MulOp.getOpcode() != ISD::MUL)
				return SDValue();

				ShrinkMode Mode;
				if (!canReduceVMulWidth(MulOp.getNode(), DAG, Mode))
				return SDValue();

				wmiUnsubmitted Done Reply Inline Actions Maybe use std::swap, so that Op0 and Op1 are unnecessary. MulOp = N->getOperand(0); Phi = N->getOperand(1); if (MulOp.getOpcode() != ISD::MUL) { std::swap(MulOp, Phi); if (MulOp.getOpcode() != ISD::MUL) return SDValue(); } wmi: %%% Maybe use std::swap, so that Op0 and Op1 are unnecessary. MulOp = N->getOperand(0); Phi = N…
				EVT VT = N->getValueType(0);
				mkuperUnsubmitted Done Reply Inline Actions Since this version always uses PMADDWD, we no longer care about Mode at all, and this entire if can go away. mkuper: Since this version always uses PMADDWD, we no longer care about Mode at all, and this entire if…

				unsigned RegSize = 128;
				if (Subtarget.hasBWI())
				RegSize = 512;
				else if (Subtarget.hasAVX2())
				RegSize = 256;
				unsigned VectorSize = VT.getVectorNumElements() * 16;
				wmiUnsubmitted Not Done Reply Inline Actions Just realize maybe we need to prove no wrap before use PMADDUBSW? It is possible that i8i8+i8i8 will overflow for i16 and PMADDUBSW will generate saturate result. However, the original i32 operations will not have overflow issue. wmi: Just realize maybe we need to prove no wrap before use PMADDUBSW? It is possible that…
				danielcdhAuthorUnsubmitted Not Done Reply Inline Actions Good catch! Looks like we should not use compiler to generate PMADDUBSW directly. Patch updated. danielcdh: Good catch! Looks like we should not use compiler to generate PMADDUBSW directly. Patch…
				// If the vector size is less than 128, or greater than the supported RegSize,
				// do not use PMADD.
				if (VectorSize < 128 \|\| VectorSize > RegSize)
				return SDValue();

				SDLoc DL(N);
				EVT ReducedVT = EVT::getVectorVT(*DAG.getContext(), MVT::i16,
				VT.getVectorNumElements());
				EVT MAddVT = EVT::getVectorVT(*DAG.getContext(), MVT::i32,
				VT.getVectorNumElements() / 2);

				// Shrink the operands of mul.
				SDValue N0 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, MulOp->getOperand(0));
				SDValue N1 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, MulOp->getOperand(1));

				// Madd vector size is half of the original vector size
				SDValue Madd = DAG.getNode(X86ISD::VPMADDWD, DL, MAddVT, N0, N1);
				// Fill the rest of the output with 0
				SDValue Zero = getZeroVector(Madd.getSimpleValueType(), Subtarget, DAG, DL);
				SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, Madd, Zero);
				return DAG.getNode(ISD::ADD, DL, VT, Concat, Phi);
				}

				mkuperUnsubmitted Not Done Reply Inline Actions Isn't Madd.getSimpleValueType() always MVT::i32? mkuper: Isn't Madd.getSimpleValueType() always MVT::i32?
				danielcdhAuthorUnsubmitted Not Done Reply Inline Actions It should be vector type instead of scalar type. danielcdh: It should be vector type instead of scalar type.
				mkuperUnsubmitted Not Done Reply Inline Actions Argh, right, sorry. mkuper: Argh, right, sorry.
				mkuperUnsubmitted Not Done Reply Inline Actions Just a sanity check - this is correct irrespective of whether the initial value for the reduction phi is 0 or not, right? mkuper: Just a sanity check - this is correct irrespective of whether the initial value for the…
				danielcdhAuthorUnsubmitted Not Done Reply Inline Actions I suppose yes. danielcdh: I suppose yes.
	static SDValue combineLoopSADPattern(SDNode *N, SelectionDAG &DAG,			static SDValue combineLoopSADPattern(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {			const X86Subtarget &Subtarget) {
	SDLoc DL(N);			SDLoc DL(N);
	EVT VT = N->getValueType(0);			EVT VT = N->getValueType(0);
	SDValue Op0 = N->getOperand(0);			SDValue Op0 = N->getOperand(0);
	SDValue Op1 = N->getOperand(1);			SDValue Op1 = N->getOperand(1);

	// TODO: There's nothing special about i32, any integer type above i16 should			// TODO: There's nothing special about i32, any integer type above i16 should
	▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines
	}			}

	static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,			static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {			const X86Subtarget &Subtarget) {
	const SDNodeFlags *Flags = &cast<BinaryWithFlagsSDNode>(N)->Flags;			const SDNodeFlags *Flags = &cast<BinaryWithFlagsSDNode>(N)->Flags;
	if (Flags->hasVectorReduction()) {			if (Flags->hasVectorReduction()) {
	if (SDValue Sad = combineLoopSADPattern(N, DAG, Subtarget))			if (SDValue Sad = combineLoopSADPattern(N, DAG, Subtarget))
	return Sad;			return Sad;
				if (SDValue MAdd = combineLoopMAddPattern(N, DAG, Subtarget))
				mkuperUnsubmitted Done Reply Inline Actions Bikeshedding: maybe call this combineLoopMAddPattern() to match the other name? mkuper: Bikeshedding: maybe call this combineLoopMAddPattern() to match the other name?
				mkuperUnsubmitted Done Reply Inline Actions Sad -> Madd? mkuper: Sad -> Madd?
				return MAdd;
	}			}
	EVT VT = N->getValueType(0);			EVT VT = N->getValueType(0);
	SDValue Op0 = N->getOperand(0);			SDValue Op0 = N->getOperand(0);
	SDValue Op1 = N->getOperand(1);			SDValue Op1 = N->getOperand(1);

	// Try to synthesize horizontal adds from adds of shuffles.			// Try to synthesize horizontal adds from adds of shuffles.
	if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|			if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|
	(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&			(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&
	▲ Show 20 Lines • Show All 1,324 Lines • Show Last 20 Lines

test/CodeGen/X86/madd.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 \| FileCheck %s --check-prefix=SSE2
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 \| FileCheck %s --check-prefix=AVX2
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f \| FileCheck %s --check-prefix=AVX512
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw \| FileCheck %s --check-prefix=AVX512

				;SSE2-label: @_Z10test_shortPsS_i
				;SSE2: movdqu
				;SSE2-NEXT: movdqu
				;SSE2-NEXT: pmaddwd
				;SSE2-NEXT: paddd

				;AVX2-label: @_Z10test_shortPsS_i
				;AVX2: vmovdqu
				;AVX2-NEXT: vpmaddwd
				;AVX2-NEXT: vinserti128
				;AVX2-NEXT: vpaddd

				;AVX512-label: @_Z10test_shortPsS_i
				;AVX512: vmovdqu
				;AVX512-NEXT: vpmaddwd
				;AVX512-NEXT: vinserti128
				;AVX512-NEXT: vpaddd

				define i32 @_Z10test_shortPsS_i(i16* nocapture readonly, i16* nocapture readonly, i32) local_unnamed_addr #0 {
				entry:
				wmiUnsubmitted Done Reply Inline Actions The test can be simpler by removing the preheader and exit. just like what test/CodeGen/X86/sad.ll does. It will be easier if it is needed to extend the test and check different kinds of element types or different vector lengths. wmi: The test can be simpler by removing the preheader and exit. just like what test/CodeGen/X86/sad.
				%3 = zext i32 %2 to i64
				br label %vector.body

				vector.body:
				%index = phi i64 [ %index.next, %vector.body ], [ 0, %entry ]
				%vec.phi = phi <8 x i32> [ %11, %vector.body ], [ zeroinitializer, %entry ]
				%4 = getelementptr inbounds i16, i16* %0, i64 %index
				%5 = bitcast i16* %4 to <8 x i16>*
				%wide.load = load <8 x i16>, <8 x i16>* %5, align 2
				%6 = sext <8 x i16> %wide.load to <8 x i32>
				%7 = getelementptr inbounds i16, i16* %1, i64 %index
				%8 = bitcast i16* %7 to <8 x i16>*
				%wide.load14 = load <8 x i16>, <8 x i16>* %8, align 2
				%9 = sext <8 x i16> %wide.load14 to <8 x i32>
				%10 = mul nsw <8 x i32> %9, %6
				%11 = add nsw <8 x i32> %10, %vec.phi
				%index.next = add i64 %index, 8
				%12 = icmp eq i64 %index.next, %3
				br i1 %12, label %middle.block, label %vector.body

				middle.block:
				%rdx.shuf = shufflevector <8 x i32> %11, <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <8 x i32> %11, %rdx.shuf
				%rdx.shuf15 = shufflevector <8 x i32> %bin.rdx, <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx16 = add <8 x i32> %bin.rdx, %rdx.shuf15
				%rdx.shuf17 = shufflevector <8 x i32> %bin.rdx16, <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx18 = add <8 x i32> %bin.rdx16, %rdx.shuf17
				%13 = extractelement <8 x i32> %bin.rdx18, i32 0
				ret i32 %13
				}

				;AVX2-label: @_Z9test_charPcS_i
				;AVX2: vpmovsxbw
				;AVX2-NEXT: vpmovsxbw
				;AVX2-NEXT: vpmaddwd
				;AVX2-NEXT: vpaddd

				;AVX512-label: @_Z9test_charPcS_i
				;AVX512: vpmovsxbw
				;AVX512-NEXT: vpmovsxbw
				;AVX512-NEXT: vpmaddwd
				;AVX512-NEXT: vinserti64x4
				;AVX512-NEXT: vpaddd

				define i32 @_Z9test_charPcS_i(i8* nocapture readonly, i8* nocapture readonly, i32) local_unnamed_addr #0 {
				entry:
				%3 = zext i32 %2 to i64
				br label %vector.body

				vector.body:
				%index = phi i64 [ %index.next, %vector.body ], [ 0, %entry ]
				%vec.phi = phi <16 x i32> [ %11, %vector.body ], [ zeroinitializer, %entry ]
				%4 = getelementptr inbounds i8, i8* %0, i64 %index
				%5 = bitcast i8* %4 to <16 x i8>*
				%wide.load = load <16 x i8>, <16 x i8>* %5, align 1
				%6 = sext <16 x i8> %wide.load to <16 x i32>
				%7 = getelementptr inbounds i8, i8* %1, i64 %index
				%8 = bitcast i8* %7 to <16 x i8>*
				%wide.load14 = load <16 x i8>, <16 x i8>* %8, align 1
				%9 = sext <16 x i8> %wide.load14 to <16 x i32>
				%10 = mul nsw <16 x i32> %9, %6
				%11 = add nsw <16 x i32> %10, %vec.phi
				%index.next = add i64 %index, 16
				%12 = icmp eq i64 %index.next, %3
				br i1 %12, label %middle.block, label %vector.body

				middle.block:
				%rdx.shuf = shufflevector <16 x i32> %11, <16 x i32> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <16 x i32> %11, %rdx.shuf
				%rdx.shuf15 = shufflevector <16 x i32> %bin.rdx, <16 x i32> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx16 = add <16 x i32> %bin.rdx, %rdx.shuf15
				%rdx.shuf17 = shufflevector <16 x i32> %bin.rdx16, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx18 = add <16 x i32> %bin.rdx16, %rdx.shuf17
				%rdx.shuf19 = shufflevector <16 x i32> %bin.rdx18, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx20 = add <16 x i32> %bin.rdx18, %rdx.shuf19
				%13 = extractelement <16 x i32> %bin.rdx20, i32 0
				ret i32 %13
				}