This is an archive of the discontinued LLVM Phabricator instance.

[X86] Match PSADBW in straight-line code
ClosedPublic

Authored by mkuper on Jul 27 2016, 3:34 PM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
wmi

Commits

rGf396b4c40dad: [X86] Match PSADBW in straight-line code
rL277219: [X86] Match PSADBW in straight-line code

Summary

Up until now, we only had code to match PSADBW patterns similar to what comes out of the loop vectorizer - a partial reduction inside the loop body that gets fed into a horizontal operation in a different block.
This adds support for straight-line patterns, like those generated by the SLP vectorizer.

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper updated this revision to Diff 65819.Jul 27 2016, 3:34 PM

mkuper retitled this revision from to [X86] Match PSADBW in straight-line code.

mkuper updated this object.

mkuper added reviewers: spatel, RKSimon, wmi.

mkuper added a subscriber: llvm-commits.

wmi added a subscriber: danielcdh.Jul 27 2016, 4:07 PM

Is there much scope to share more of the code with combineLoopSADPattern?

lib/Target/X86/X86ISelLowering.cpp
26461 ↗	(On Diff #65819)	To support wider types is there any way that you can split the vector, perform PSAD on both and then combine the 2 results?
26500 ↗	(On Diff #65819)	Minor, but move this into the for() loop's initializer?
26529 ↗	(On Diff #65819)	Move into for loop

Thanks, Simon!

I tried to share as much code as I could, see https://reviews.llvm.org/rL276798 and https://reviews.llvm.org/rL276918 .
I don't see a way to share the "shuffle pyramid" detection code, unfortunately - for the loop case, we are forced to do it in IR, because the pyramid lives in a different basic block.
And the type checks in the beginning didn't look like they were worth it.

If you have suggestions for what I can factor out, let me know, I'll be happy to try.

lib/Target/X86/X86ISelLowering.cpp
26461 ↗	(On Diff #65819)	I think so. Right now we don't do this for the loop version either (I didn't write that code originally :-) ), and I kept the same constraint here. I'd prefer to leave it as a TODO, in both places.
26500 ↗	(On Diff #65819)	Sure (for both loops)

Another few minor thoughts, I'm happy if you'd prefer to just make these all future TODOs.

lib/Target/X86/X86ISelLowering.cpp
26450 ↗	(On Diff #65819)	Doesn't PSADBW generate unsigned i16 results? IIRC the horizontal sum of absdiff v16i8 / v32i8 / v64i8 should fit into a single i16 correct? Really there's nothing stopping us supporting any result integer type >= 16-bits no?
26461 ↗	(On Diff #65819)	OK - a TODO comment is fine.
26481 ↗	(On Diff #65819)	This looks like it could be useful for general matching of reduction/horizontal ops - possibly pull it out? I have in mind detecting any_of/all_of tests for vector comparison results that I'd like to use MOVMSK for instead - in that case it'd be the same code but we'd match against OR / AND instead of ADD.

mkuper added inline comments.Jul 29 2016, 10:09 AM

lib/Target/X86/X86ISelLowering.cpp
26450 ↗	(On Diff #65819)	You're right, it's another constraint the original loop code had, and removing it seems trivial. I'll add a TODO and fix this in a separate patch.
26481 ↗	(On Diff #65819)	Sure, I'll pull it out. TBH, I wouldn't be surprised if we already had something similar, but I didn't see it. There's isHorizontalBinOp(), but it's different - it matches the x86 "horizontal" operations (like PHADDD).

Updated per Simon's comments.

LGTM

This revision is now accepted and ready to land.Jul 29 2016, 2:28 PM

Closed by commit rL277219: [X86] Match PSADBW in straight-line code (authored by mkuper). · Explain WhyJul 29 2016, 2:53 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

137 lines

test/

CodeGen/

X86/

sad.ll

309 lines

Diff 66178

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 26,272 Lines • ▼ Show 20 Lines	if (((Subtarget.hasSSE1() && VT == MVT::f32) \|\|
SDValue N000 = N0.getOperand(0).getOperand(0);		SDValue N000 = N0.getOperand(0).getOperand(0);
SDValue FPConst = DAG.getBitcast(VT, N0.getOperand(1));		SDValue FPConst = DAG.getBitcast(VT, N0.getOperand(1));
return DAG.getNode(FPOpcode, SDLoc(N0), VT, N000, FPConst);		return DAG.getNode(FPOpcode, SDLoc(N0), VT, N000, FPConst);
}		}

return SDValue();		return SDValue();
}		}


		// Match a binop + shuffle pyramid that represents a horizontal reduction over
		// the elements of a vector.
		// Returns the vector that is being reduced on, or SDValue() if a reduction
		// was not matched.
		static SDValue matchBinOpReduction(SDNode *Extract, ISD::NodeType BinOp) {
		// The pattern must end in an extract from index 0.
		if ((Extract->getOpcode() != ISD::EXTRACT_VECTOR_ELT) \|\|
		!isNullConstant(Extract->getOperand(1)))
		return SDValue();

		unsigned Stages =
		Log2_32(Extract->getOperand(0).getValueType().getVectorNumElements());

		SDValue Op = Extract->getOperand(0);
		// At each stage, we're looking for something that looks like:
		// %s = shufflevector <8 x i32> %op, <8 x i32> undef,
		// <8 x i32> <i32 2, i32 3, i32 undef, i32 undef,
		// i32 undef, i32 undef, i32 undef, i32 undef>
		// %a = binop <8 x i32> %op, %s
		// Where the mask changes according to the stage. E.g. for a 3-stage pyramid,
		// we expect something like:
		// <4,5,6,7,u,u,u,u>
		// <2,3,u,u,u,u,u,u>
		// <1,u,u,u,u,u,u,u>
		for (unsigned i = 0; i < Stages; ++i) {
		if (Op.getOpcode() != BinOp)
		return SDValue();

		ShuffleVectorSDNode *Shuffle =
		dyn_cast<ShuffleVectorSDNode>(Op.getOperand(0).getNode());
		if (Shuffle) {
		Op = Op.getOperand(1);
		} else {
		Shuffle = dyn_cast<ShuffleVectorSDNode>(Op.getOperand(1).getNode());
		Op = Op.getOperand(0);
		}

		// The first operand of the shuffle should be the same as the other operand
		// of the add.
		if (!Shuffle \|\| (Shuffle->getOperand(0) != Op))
		return SDValue();

		// Verify the shuffle has the expected (at this stage of the pyramid) mask.
		for (int Index = 0, MaskEnd = 1 << i; Index < MaskEnd; ++Index)
		if (Shuffle->getMaskElt(Index) != MaskEnd + Index)
		return SDValue();
		}

		return Op;
		}

// Given a select, detect the following pattern:		// Given a select, detect the following pattern:
// 1: %2 = zext <N x i8> %0 to <N x i32>		// 1: %2 = zext <N x i8> %0 to <N x i32>
// 2: %3 = zext <N x i8> %1 to <N x i32>		// 2: %3 = zext <N x i8> %1 to <N x i32>
// 3: %4 = sub nsw <N x i32> %2, %3		// 3: %4 = sub nsw <N x i32> %2, %3
// 4: %5 = icmp sgt <N x i32> %4, [0 x N] or [-1 x N]		// 4: %5 = icmp sgt <N x i32> %4, [0 x N] or [-1 x N]
// 5: %6 = sub nsw <N x i32> zeroinitializer, %4		// 5: %6 = sub nsw <N x i32> zeroinitializer, %4
// 6: %7 = select <N x i1> %5, <N x i32> %4, <N x i32> %6		// 6: %7 = select <N x i1> %5, <N x i32> %4, <N x i32> %6
// This is useful as it is the input into a SAD pattern.		// This is useful as it is the input into a SAD pattern.
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	static SDValue createPSADBW(SelectionDAG &DAG, const SDValue &Zext0,
Ops[0] = Zext1.getOperand(0);		Ops[0] = Zext1.getOperand(0);
SDValue SadOp1 = DAG.getNode(ISD::CONCAT_VECTORS, DL, ExtendedVT, Ops);		SDValue SadOp1 = DAG.getNode(ISD::CONCAT_VECTORS, DL, ExtendedVT, Ops);

// Actually build the SAD		// Actually build the SAD
MVT SadVT = MVT::getVectorVT(MVT::i64, RegSize / 64);		MVT SadVT = MVT::getVectorVT(MVT::i64, RegSize / 64);
return DAG.getNode(X86ISD::PSADBW, DL, SadVT, SadOp0, SadOp1);		return DAG.getNode(X86ISD::PSADBW, DL, SadVT, SadOp0, SadOp1);
}		}

		static SDValue combineBasicSADPattern(SDNode *Extract, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		// PSADBW is only supported on SSE2 and up.
		if (!Subtarget.hasSSE2())
		return SDValue();

		// Verify the type we're extracting from is appropriate
		// TODO: There's nothing special about i32, any integer type above i16 should
		// work just as well.
		EVT VT = Extract->getOperand(0).getValueType();
		if (!VT.isSimple() \|\| !(VT.getVectorElementType() == MVT::i32))
		return SDValue();

		unsigned RegSize = 128;
		if (Subtarget.hasBWI())
		RegSize = 512;
		else if (Subtarget.hasAVX2())
		RegSize = 256;

		// We only handle v16i32 for SSE2 / v32i32 for AVX2 / v64i32 for AVX512.
		// TODO: We should be able to handle larger vectors by splitting them before
		// feeding them into several SADs, and then reducing over those.
		if (VT.getSizeInBits() / 4 > RegSize)
		return SDValue();

		// Match shuffle + add pyramid.
		SDValue Root = matchBinOpReduction(Extract, ISD::ADD);

		// If there was a match, we want Root to be a select that is the root of an
		// abs-diff pattern.
		if (!Root \|\| (Root.getOpcode() != ISD::VSELECT))
		return SDValue();

		// Check whether we have an abs-diff pattern feeding into the select.
		SDValue Zext0, Zext1;
		if (!detectZextAbsDiff(Root, Zext0, Zext1))
		return SDValue();

		// Create the SAD instruction
		SDLoc DL(Extract);
		SDValue SAD = createPSADBW(DAG, Zext0, Zext1, DL);

		// If the original vector was wider than 8 elements, sum over the results
		// in the SAD vector.
		unsigned Stages = Log2_32(VT.getVectorNumElements());
		MVT SadVT = SAD.getSimpleValueType();
		if (Stages > 3) {
		unsigned SadElems = SadVT.getVectorNumElements();

		for(unsigned i = Stages - 3; i > 0; --i) {
		SmallVector<int, 16> Mask(SadElems, -1);
		for(unsigned j = 0, MaskEnd = 1 << (i - 1); j < MaskEnd; ++j)
		Mask[j] = MaskEnd + j;

		SDValue Shuffle =
		DAG.getVectorShuffle(SadVT, DL, SAD, DAG.getUNDEF(SadVT), Mask);
		SAD = DAG.getNode(ISD::ADD, DL, SadVT, SAD, Shuffle);
		}
		}


		// Return the lowest i32.
		MVT ResVT = MVT::getVectorVT(MVT::i32, SadVT.getSizeInBits() / 32);
		SAD = DAG.getNode(ISD::BITCAST, DL, ResVT, SAD);
		return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i32, SAD,
		Extract->getOperand(1));
		}

/// Detect vector gather/scatter index generation and convert it from being a		/// Detect vector gather/scatter index generation and convert it from being a
/// bunch of shuffles and extracts into a somewhat faster sequence.		/// bunch of shuffles and extracts into a somewhat faster sequence.
/// For i686, the best sequence is apparently storing the value and loading		/// For i686, the best sequence is apparently storing the value and loading
/// scalars back, while for x64 we should use 64-bit extracts and shifts.		/// scalars back, while for x64 we should use 64-bit extracts and shifts.
static SDValue combineExtractVectorElt(SDNode *N, SelectionDAG &DAG,		static SDValue combineExtractVectorElt(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI) {		TargetLowering::DAGCombinerInfo &DCI,
		const X86Subtarget &Subtarget) {
if (SDValue NewOp = XFormVExtractWithShuffleIntoLoad(N, DAG, DCI))		if (SDValue NewOp = XFormVExtractWithShuffleIntoLoad(N, DAG, DCI))
return NewOp;		return NewOp;

SDValue InputVector = N->getOperand(0);		SDValue InputVector = N->getOperand(0);
SDLoc dl(InputVector);		SDLoc dl(InputVector);
// Detect mmx to i32 conversion through a v2i32 elt extract.		// Detect mmx to i32 conversion through a v2i32 elt extract.
if (InputVector.getOpcode() == ISD::BITCAST && InputVector.hasOneUse() &&		if (InputVector.getOpcode() == ISD::BITCAST && InputVector.hasOneUse() &&
N->getValueType(0) == MVT::i32 &&		N->getValueType(0) == MVT::i32 &&
Show All 14 Lines	if (VT == MVT::i1 && isa<ConstantSDNode>(N->getOperand(1)) &&
isa<ConstantSDNode>(InputVector.getOperand(0))) {		isa<ConstantSDNode>(InputVector.getOperand(0))) {
uint64_t ExtractedElt =		uint64_t ExtractedElt =
cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();		cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
uint64_t InputValue =		uint64_t InputValue =
cast<ConstantSDNode>(InputVector.getOperand(0))->getZExtValue();		cast<ConstantSDNode>(InputVector.getOperand(0))->getZExtValue();
uint64_t Res = (InputValue >> ExtractedElt) & 1;		uint64_t Res = (InputValue >> ExtractedElt) & 1;
return DAG.getConstant(Res, dl, MVT::i1);		return DAG.getConstant(Res, dl, MVT::i1);
}		}

		// Check whether this extract is the root of a sum of absolute differences
		// pattern. This has to be done here because we really want it to happen
		// pre-legalization,
		if (SDValue SAD = combineBasicSADPattern(N, DAG, Subtarget))
		return SAD;

// Only operate on vectors of 4 elements, where the alternative shuffling		// Only operate on vectors of 4 elements, where the alternative shuffling
// gets to be more expensive.		// gets to be more expensive.
if (InputVector.getValueType() != MVT::v4i32)		if (InputVector.getValueType() != MVT::v4i32)
return SDValue();		return SDValue();

// Check whether every use of InputVector is an EXTRACT_VECTOR_ELT with a		// Check whether every use of InputVector is an EXTRACT_VECTOR_ELT with a
// single use which is a sign-extend or zero-extend, and all elements are		// single use which is a sign-extend or zero-extend, and all elements are
// used.		// used.
▲ Show 20 Lines • Show All 4,320 Lines • ▼ Show 20 Lines

static SDValue combineLoopSADPattern(SDNode *N, SelectionDAG &DAG,		static SDValue combineLoopSADPattern(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
SDLoc DL(N);		SDLoc DL(N);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDValue Op0 = N->getOperand(0);		SDValue Op0 = N->getOperand(0);
SDValue Op1 = N->getOperand(1);		SDValue Op1 = N->getOperand(1);

		// TODO: There's nothing special about i32, any integer type above i16 should
		// work just as well.
if (!VT.isVector() \|\| !VT.isSimple() \|\|		if (!VT.isVector() \|\| !VT.isSimple() \|\|
!(VT.getVectorElementType() == MVT::i32))		!(VT.getVectorElementType() == MVT::i32))
return SDValue();		return SDValue();

unsigned RegSize = 128;		unsigned RegSize = 128;
if (Subtarget.hasBWI())		if (Subtarget.hasBWI())
RegSize = 512;		RegSize = 512;
else if (Subtarget.hasAVX2())		else if (Subtarget.hasAVX2())
RegSize = 256;		RegSize = 256;

// We only handle v16i32 for SSE2 / v32i32 for AVX2 / v64i32 for AVX512.		// We only handle v16i32 for SSE2 / v32i32 for AVX2 / v64i32 for AVX512.
		// TODO: We should be able to handle larger vectors by splitting them before
		// feeding them into several SADs, and then reducing over those.
if (VT.getSizeInBits() / 4 > RegSize)		if (VT.getSizeInBits() / 4 > RegSize)
return SDValue();		return SDValue();

// We know N is a reduction add, which means one of its operands is a phi.		// We know N is a reduction add, which means one of its operands is a phi.
// To match SAD, we need the other operand to be a vector select.		// To match SAD, we need the other operand to be a vector select.
SDValue SelectOp, Phi;		SDValue SelectOp, Phi;
if (Op0.getOpcode() == ISD::VSELECT) {		if (Op0.getOpcode() == ISD::VSELECT) {
SelectOp = Op0;		SelectOp = Op0;
▲ Show 20 Lines • Show All 221 Lines • ▼ Show 20 Lines
}		}


SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,		SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
switch (N->getOpcode()) {		switch (N->getOpcode()) {
default: break;		default: break;
case ISD::EXTRACT_VECTOR_ELT: return combineExtractVectorElt(N, DAG, DCI);		case ISD::EXTRACT_VECTOR_ELT:
		return combineExtractVectorElt(N, DAG, DCI, Subtarget);
case ISD::VSELECT:		case ISD::VSELECT:
case ISD::SELECT:		case ISD::SELECT:
case X86ISD::SHRUNKBLEND: return combineSelect(N, DAG, DCI, Subtarget);		case X86ISD::SHRUNKBLEND: return combineSelect(N, DAG, DCI, Subtarget);
case ISD::BITCAST: return combineBitcast(N, DAG, Subtarget);		case ISD::BITCAST: return combineBitcast(N, DAG, Subtarget);
case X86ISD::CMOV: return combineCMov(N, DAG, DCI, Subtarget);		case X86ISD::CMOV: return combineCMov(N, DAG, DCI, Subtarget);
case ISD::ADD: return combineAdd(N, DAG, Subtarget);		case ISD::ADD: return combineAdd(N, DAG, Subtarget);
case ISD::SUB: return combineSub(N, DAG, Subtarget);		case ISD::SUB: return combineSub(N, DAG, Subtarget);
case X86ISD::ADC: return combineADC(N, DAG, DCI);		case X86ISD::ADC: return combineADC(N, DAG, DCI);
▲ Show 20 Lines • Show All 969 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/sad.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; NOTE: Assertions have been autogenerated by update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 \| FileCheck %s --check-prefix=SSE2			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 \| FileCheck %s --check-prefix=SSE2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 \| FileCheck %s --check-prefix=AVX2			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 \| FileCheck %s --check-prefix=AVX2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f \| FileCheck %s --check-prefix=AVX512F			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f \| FileCheck %s --check-prefix=AVX512F
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw \| FileCheck %s --check-prefix=AVX512BW			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw \| FileCheck %s --check-prefix=AVX512BW

	@a = global [1024 x i8] zeroinitializer, align 16			@a = global [1024 x i8] zeroinitializer, align 16
	@b = global [1024 x i8] zeroinitializer, align 16			@b = global [1024 x i8] zeroinitializer, align 16

	▲ Show 20 Lines • Show All 983 Lines • ▼ Show 20 Lines
	middle.block:			middle.block:
	%.lcssa = phi <2 x i32> [ %10, %vector.body ]			%.lcssa = phi <2 x i32> [ %10, %vector.body ]
	%rdx.shuf = shufflevector <2 x i32> %.lcssa, <2 x i32> undef, <2 x i32> <i32 1, i32 undef>			%rdx.shuf = shufflevector <2 x i32> %.lcssa, <2 x i32> undef, <2 x i32> <i32 1, i32 undef>
	%bin.rdx = add <2 x i32> %.lcssa, %rdx.shuf			%bin.rdx = add <2 x i32> %.lcssa, %rdx.shuf
	%12 = extractelement <2 x i32> %bin.rdx, i32 0			%12 = extractelement <2 x i32> %bin.rdx, i32 0
	ret i32 %12			ret i32 %12
	}			}

				define i32 @sad_nonloop_4i8(<4 x i8>* nocapture readonly %p, i64, <4 x i8>* nocapture readonly %q) local_unnamed_addr #0 {
				; SSE2-LABEL: sad_nonloop_4i8:
				; SSE2: # BB#0:
				; SSE2-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; SSE2-NEXT: psadbw %xmm0, %xmm1
				; SSE2-NEXT: movd %xmm1, %eax
				; SSE2-NEXT: retq
				;
				; AVX2-LABEL: sad_nonloop_4i8:
				; AVX2: # BB#0:
				; AVX2-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; AVX2-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; AVX2-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX2-NEXT: vmovd %xmm0, %eax
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: sad_nonloop_4i8:
				; AVX512F: # BB#0:
				; AVX512F-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; AVX512F-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; AVX512F-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX512F-NEXT: vmovd %xmm0, %eax
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: sad_nonloop_4i8:
				; AVX512BW: # BB#0:
				; AVX512BW-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; AVX512BW-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; AVX512BW-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX512BW-NEXT: vmovd %xmm0, %eax
				; AVX512BW-NEXT: retq
				%v1 = load <4 x i8>, <4 x i8>* %p, align 1
				%z1 = zext <4 x i8> %v1 to <4 x i32>
				%v2 = load <4 x i8>, <4 x i8>* %q, align 1
				%z2 = zext <4 x i8> %v2 to <4 x i32>
				%sub = sub nsw <4 x i32> %z1, %z2
				%isneg = icmp sgt <4 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1>
				%neg = sub nsw <4 x i32> zeroinitializer, %sub
				%abs = select <4 x i1> %isneg, <4 x i32> %sub, <4 x i32> %neg
				%h2 = shufflevector <4 x i32> %abs, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
				%sum2 = add <4 x i32> %abs, %h2
				%h3 = shufflevector <4 x i32> %sum2, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
				%sum3 = add <4 x i32> %sum2, %h3
				%sum = extractelement <4 x i32> %sum3, i32 0
				ret i32 %sum
				}

				define i32 @sad_nonloop_8i8(<8 x i8>* nocapture readonly %p, i64, <8 x i8>* nocapture readonly %q) local_unnamed_addr #0 {
				; SSE2-LABEL: sad_nonloop_8i8:
				; SSE2: # BB#0:
				; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
				; SSE2-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
				; SSE2-NEXT: psadbw %xmm0, %xmm1
				; SSE2-NEXT: movd %xmm1, %eax
				; SSE2-NEXT: retq
				;
				; AVX2-LABEL: sad_nonloop_8i8:
				; AVX2: # BB#0:
				; AVX2-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX2-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVX2-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX2-NEXT: vmovd %xmm0, %eax
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: sad_nonloop_8i8:
				; AVX512F: # BB#0:
				; AVX512F-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX512F-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVX512F-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX512F-NEXT: vmovd %xmm0, %eax
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: sad_nonloop_8i8:
				; AVX512BW: # BB#0:
				; AVX512BW-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX512BW-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVX512BW-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX512BW-NEXT: vmovd %xmm0, %eax
				; AVX512BW-NEXT: retq
				%v1 = load <8 x i8>, <8 x i8>* %p, align 1
				%z1 = zext <8 x i8> %v1 to <8 x i32>
				%v2 = load <8 x i8>, <8 x i8>* %q, align 1
				%z2 = zext <8 x i8> %v2 to <8 x i32>
				%sub = sub nsw <8 x i32> %z1, %z2
				%isneg = icmp sgt <8 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%neg = sub nsw <8 x i32> zeroinitializer, %sub
				%abs = select <8 x i1> %isneg, <8 x i32> %sub, <8 x i32> %neg
				%h1 = shufflevector <8 x i32> %abs, <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum1 = add <8 x i32> %abs, %h1
				%h2 = shufflevector <8 x i32> %sum1, <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum2 = add <8 x i32> %sum1, %h2
				%h3 = shufflevector <8 x i32> %sum2, <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum3 = add <8 x i32> %sum2, %h3
				%sum = extractelement <8 x i32> %sum3, i32 0
				ret i32 %sum
				}

				define i32 @sad_nonloop_16i8(<16 x i8>* nocapture readonly %p, i64, <16 x i8>* nocapture readonly %q) local_unnamed_addr #0 {
				; SSE2-LABEL: sad_nonloop_16i8:
				; SSE2: # BB#0:
				; SSE2-NEXT: movdqu (%rdi), %xmm0
				; SSE2-NEXT: movdqu (%rdx), %xmm1
				; SSE2-NEXT: psadbw %xmm0, %xmm1
				; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]
				; SSE2-NEXT: paddq %xmm1, %xmm0
				; SSE2-NEXT: movd %xmm0, %eax
				; SSE2-NEXT: retq
				;
				; AVX2-LABEL: sad_nonloop_16i8:
				; AVX2: # BB#0:
				; AVX2-NEXT: vmovdqu (%rdi), %xmm0
				; AVX2-NEXT: vpsadbw (%rdx), %xmm0, %xmm0
				; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX2-NEXT: vpaddq %xmm1, %xmm0, %xmm0
				; AVX2-NEXT: vmovd %xmm0, %eax
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: sad_nonloop_16i8:
				; AVX512F: # BB#0:
				; AVX512F-NEXT: vmovdqu (%rdi), %xmm0
				; AVX512F-NEXT: vpsadbw (%rdx), %xmm0, %xmm0
				; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX512F-NEXT: vpaddq %xmm1, %xmm0, %xmm0
				; AVX512F-NEXT: vmovd %xmm0, %eax
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: sad_nonloop_16i8:
				; AVX512BW: # BB#0:
				; AVX512BW-NEXT: vmovdqu (%rdi), %xmm0
				; AVX512BW-NEXT: vpsadbw (%rdx), %xmm0, %xmm0
				; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX512BW-NEXT: vpaddq %xmm1, %xmm0, %xmm0
				; AVX512BW-NEXT: vmovd %xmm0, %eax
				; AVX512BW-NEXT: retq
				%v1 = load <16 x i8>, <16 x i8>* %p, align 1
				%z1 = zext <16 x i8> %v1 to <16 x i32>
				%v2 = load <16 x i8>, <16 x i8>* %q, align 1
				%z2 = zext <16 x i8> %v2 to <16 x i32>
				%sub = sub nsw <16 x i32> %z1, %z2
				%isneg = icmp sgt <16 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%neg = sub nsw <16 x i32> zeroinitializer, %sub
				%abs = select <16 x i1> %isneg, <16 x i32> %sub, <16 x i32> %neg
				%h0 = shufflevector <16 x i32> %abs, <16 x i32> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum0 = add <16 x i32> %abs, %h0
				%h1 = shufflevector <16 x i32> %sum0, <16 x i32> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum1 = add <16 x i32> %sum0, %h1
				%h2 = shufflevector <16 x i32> %sum1, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum2 = add <16 x i32> %sum1, %h2
				%h3 = shufflevector <16 x i32> %sum2, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum3 = add <16 x i32> %sum2, %h3
				%sum = extractelement <16 x i32> %sum3, i32 0
				ret i32 %sum
				}

				define i32 @sad_nonloop_32i8(<32 x i8>* nocapture readonly %p, i64, <32 x i8>* nocapture readonly %q) local_unnamed_addr #0 {
				; SSE2-LABEL: sad_nonloop_32i8:
				; SSE2: # BB#0:
				; SSE2-NEXT: movdqu (%rdi), %xmm12
				; SSE2-NEXT: movdqu 16(%rdi), %xmm2
				; SSE2-NEXT: pshufd {{.*#+}} xmm13 = xmm2[2,3,0,1]
				; SSE2-NEXT: pxor %xmm5, %xmm5
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm13 = xmm13[0],xmm5[0],xmm13[1],xmm5[1],xmm13[2],xmm5[2],xmm13[3],xmm5[3],xmm13[4],xmm5[4],xmm13[5],xmm5[5],xmm13[6],xmm5[6],xmm13[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm13, %xmm9
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm9 = xmm9[4],xmm5[4],xmm9[5],xmm5[5],xmm9[6],xmm5[6],xmm9[7],xmm5[7]
				; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm12[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm5[0],xmm1[1],xmm5[1],xmm1[2],xmm5[2],xmm1[3],xmm5[3],xmm1[4],xmm5[4],xmm1[5],xmm5[5],xmm1[6],xmm5[6],xmm1[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm1, %xmm3
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm5[4],xmm3[5],xmm5[5],xmm3[6],xmm5[6],xmm3[7],xmm5[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm5[0],xmm2[1],xmm5[1],xmm2[2],xmm5[2],xmm2[3],xmm5[3],xmm2[4],xmm5[4],xmm2[5],xmm5[5],xmm2[6],xmm5[6],xmm2[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm2, %xmm10
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm10 = xmm10[4],xmm5[4],xmm10[5],xmm5[5],xmm10[6],xmm5[6],xmm10[7],xmm5[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm12 = xmm12[0],xmm5[0],xmm12[1],xmm5[1],xmm12[2],xmm5[2],xmm12[3],xmm5[3],xmm12[4],xmm5[4],xmm12[5],xmm5[5],xmm12[6],xmm5[6],xmm12[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm12, %xmm11
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm5[4],xmm11[5],xmm5[5],xmm11[6],xmm5[6],xmm11[7],xmm5[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm13 = xmm13[0],xmm5[0],xmm13[1],xmm5[1],xmm13[2],xmm5[2],xmm13[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm5[0],xmm1[1],xmm5[1],xmm1[2],xmm5[2],xmm1[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm5[0],xmm2[1],xmm5[1],xmm2[2],xmm5[2],xmm2[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm12 = xmm12[0],xmm5[0],xmm12[1],xmm5[1],xmm12[2],xmm5[2],xmm12[3],xmm5[3]
				; SSE2-NEXT: movdqu (%rdx), %xmm7
				; SSE2-NEXT: movdqu 16(%rdx), %xmm0
				; SSE2-NEXT: pshufd {{.*#+}} xmm6 = xmm0[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm6 = xmm6[0],xmm5[0],xmm6[1],xmm5[1],xmm6[2],xmm5[2],xmm6[3],xmm5[3],xmm6[4],xmm5[4],xmm6[5],xmm5[5],xmm6[6],xmm5[6],xmm6[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm6, %xmm4
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm5[4],xmm4[5],xmm5[5],xmm4[6],xmm5[6],xmm4[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm7[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3],xmm4[4],xmm5[4],xmm4[5],xmm5[5],xmm4[6],xmm5[6],xmm4[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm4, %xmm14
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm14 = xmm14[4],xmm5[4],xmm14[5],xmm5[5],xmm14[6],xmm5[6],xmm14[7],xmm5[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm5[0],xmm0[1],xmm5[1],xmm0[2],xmm5[2],xmm0[3],xmm5[3],xmm0[4],xmm5[4],xmm0[5],xmm5[5],xmm0[6],xmm5[6],xmm0[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm0, %xmm15
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm15 = xmm15[4],xmm5[4],xmm15[5],xmm5[5],xmm15[6],xmm5[6],xmm15[7],xmm5[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm5[0],xmm7[1],xmm5[1],xmm7[2],xmm5[2],xmm7[3],xmm5[3],xmm7[4],xmm5[4],xmm7[5],xmm5[5],xmm7[6],xmm5[6],xmm7[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm7, %xmm8
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm8 = xmm8[4],xmm5[4],xmm8[5],xmm5[5],xmm8[6],xmm5[6],xmm8[7],xmm5[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm5[0],xmm6[1],xmm5[1],xmm6[2],xmm5[2],xmm6[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm5[0],xmm0[1],xmm5[1],xmm0[2],xmm5[2],xmm0[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm5[0],xmm7[1],xmm5[1],xmm7[2],xmm5[2],xmm7[3],xmm5[3]
				; SSE2-NEXT: psubd %xmm7, %xmm12
				; SSE2-NEXT: psubd %xmm0, %xmm2
				; SSE2-NEXT: psubd %xmm4, %xmm1
				; SSE2-NEXT: psubd %xmm6, %xmm13
				; SSE2-NEXT: psubd %xmm8, %xmm11
				; SSE2-NEXT: psubd %xmm15, %xmm10
				; SSE2-NEXT: psubd %xmm14, %xmm3
				; SSE2-NEXT: psubd -{{[0-9]+}}(%rsp), %xmm9 # 16-byte Folded Reload
				; SSE2-NEXT: movdqa %xmm9, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm9
				; SSE2-NEXT: pxor %xmm0, %xmm9
				; SSE2-NEXT: movdqa %xmm3, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm3
				; SSE2-NEXT: pxor %xmm0, %xmm3
				; SSE2-NEXT: movdqa %xmm10, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm10
				; SSE2-NEXT: pxor %xmm0, %xmm10
				; SSE2-NEXT: movdqa %xmm11, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm11
				; SSE2-NEXT: pxor %xmm0, %xmm11
				; SSE2-NEXT: movdqa %xmm13, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm13
				; SSE2-NEXT: pxor %xmm0, %xmm13
				; SSE2-NEXT: movdqa %xmm1, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm1
				; SSE2-NEXT: pxor %xmm0, %xmm1
				; SSE2-NEXT: movdqa %xmm2, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm2
				; SSE2-NEXT: pxor %xmm0, %xmm2
				; SSE2-NEXT: movdqa %xmm12, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm12
				; SSE2-NEXT: pxor %xmm0, %xmm12
				; SSE2-NEXT: paddd %xmm13, %xmm1
				; SSE2-NEXT: paddd %xmm9, %xmm3
				; SSE2-NEXT: paddd %xmm10, %xmm3
				; SSE2-NEXT: paddd %xmm11, %xmm3
				; SSE2-NEXT: paddd %xmm2, %xmm1
				; SSE2-NEXT: paddd %xmm3, %xmm1
				; SSE2-NEXT: paddd %xmm12, %xmm1
				; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]
				; SSE2-NEXT: paddd %xmm1, %xmm0
				; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
				; SSE2-NEXT: paddd %xmm0, %xmm1
				; SSE2-NEXT: movd %xmm1, %eax
				; SSE2-NEXT: retq
				;
				; AVX2-LABEL: sad_nonloop_32i8:
				; AVX2: # BB#0:
				; AVX2-NEXT: vmovdqu (%rdi), %ymm0
				; AVX2-NEXT: vpsadbw (%rdx), %ymm0, %ymm0
				; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX2-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX2-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vmovd %xmm0, %eax
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: sad_nonloop_32i8:
				; AVX512F: # BB#0:
				; AVX512F-NEXT: vmovdqu (%rdi), %ymm0
				; AVX512F-NEXT: vpsadbw (%rdx), %ymm0, %ymm0
				; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX512F-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX512F-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX512F-NEXT: vmovd %xmm0, %eax
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: sad_nonloop_32i8:
				; AVX512BW: # BB#0:
				; AVX512BW-NEXT: vmovdqu (%rdi), %ymm0
				; AVX512BW-NEXT: vpsadbw (%rdx), %ymm0, %ymm0
				; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX512BW-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX512BW-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: vmovd %xmm0, %eax
				; AVX512BW-NEXT: retq
				%v1 = load <32 x i8>, <32 x i8>* %p, align 1
				%z1 = zext <32 x i8> %v1 to <32 x i32>
				%v2 = load <32 x i8>, <32 x i8>* %q, align 1
				%z2 = zext <32 x i8> %v2 to <32 x i32>
				%sub = sub nsw <32 x i32> %z1, %z2
				%isneg = icmp sgt <32 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%neg = sub nsw <32 x i32> zeroinitializer, %sub
				%abs = select <32 x i1> %isneg, <32 x i32> %sub, <32 x i32> %neg
				%h32 = shufflevector <32 x i32> %abs, <32 x i32> undef, <32 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum32 = add <32 x i32> %abs, %h32
				%h0 = shufflevector <32 x i32> %sum32, <32 x i32> undef, <32 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum0 = add <32 x i32> %sum32, %h0
				%h1 = shufflevector <32 x i32> %sum0, <32 x i32> undef, <32 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum1 = add <32 x i32> %sum0, %h1
				%h2 = shufflevector <32 x i32> %sum1, <32 x i32> undef, <32 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum2 = add <32 x i32> %sum1, %h2
				%h3 = shufflevector <32 x i32> %sum2, <32 x i32> undef, <32 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum3 = add <32 x i32> %sum2, %h3
				%sum = extractelement <32 x i32> %sum3, i32 0
				ret i32 %sum
				}