This is an archive of the discontinued LLVM Phabricator instance.

[X86] Detect SAD patterns and emit psadbw instructions on X86.
ClosedPublic

Authored by congh on Nov 19 2015, 1:03 PM.

Download Raw Diff

Details

Reviewers

RKSimon
davidxl
hfinkel

Commits

rG6f879d9eb1a1: Detects the SAD pattern on X86 so that much better code will be emitted once…
rL267649: Detects the SAD pattern on X86 so that much better code will be emitted once…

Summary

As we now can detect vectorized reduction operations, it's time to use this feature! This patch detects the SAD pattern on X86 so that much better code will be emitted once the pattern is matched.

Diff Detail

Repository: rL LLVM

Event Timeline

congh updated this revision to Diff 40699.Nov 19 2015, 1:03 PM

congh retitled this revision from to Detect SAD patterns and emit psadbw instructions on X86..

congh updated this object.

Use Metadata to annotate reduction PHI.

congh added a reviewer: davidxl.Nov 19 2015, 10:00 PM

Is it ready to be reviewed upstream?

David

I am splitting this patch and a subset is now under review
(http://reviews.llvm.org/D14897). After that patch is landed, I will
send this patch to upstream for review.

thanks,
Cong

Update the patch after part of it is checked in.

Upload the correct patch.

Update the patch.

congh retitled this revision from Detect SAD patterns and emit psadbw instructions on X86. to [X86] Detect SAD patterns and emit psadbw instructions on X86..Feb 16 2016, 11:12 PM

congh updated this object.

congh added reviewers: hfinkel, RKSimon.

congh added a subscriber: llvm-commits.

Ping?

ab added a subscriber: ab.Feb 25 2016, 3:10 PM

ab added inline comments.

lib/Target/X86/X86ISelLowering.cpp
28588–28600 ↗	(On Diff #48154)	Perhaps you could replace this with BuildVectorSDNode::getConstantSplatNode()? ..but looking at the uses, you can replace them all with isBuildVectorAllOnes/isBuildVectorAllZeros.
28613 ↗	(On Diff #48154)	All vector MVTs have power-of-2 NumElems, so I think you can drop the second part (and NumElems, and merge the ifs).
28616–28623 ↗	(On Diff #48154)	What happens on v2i32? Also, doesn't this require SSE2? I'd just explicitly check subtarget and type, like we do elsewhere. So, something like: if ((VT == MVT::v16i32 && !Subtarget.hasAVX512()) \|\| (VT == MVT::v8i32 && !Subtarget.hasAVX2()) \|\| (VT == MVT::v4i32 && !Subtarget.hasSSE2())) return SDValue();
28697–28701 ↗	(On Diff #48154)	Why is this necessary? It seems to me that this is forcing RegSize to 128 (because even v16i8, which requires AVX-512 as it's added as v16i32 has a size of 128). In turn, that forces XMM-sized PSADs to be generated, even on AVX2. Am I missing something?
28740–28742 ↗	(On Diff #48154)	if (SDValue Sad = detectSADPattern(N, DAG, Subtarget)) return Sad;
test/CodeGen/X86/sad.ll
1–6 ↗	(On Diff #48154)	IMHO you should only test the minimal IR with llc here. I see the argument for testing the whole thing, but it seems like 1) that prevents you from testing unrelated constructs (not all IR comes from the vectorizer), and 2) there should already be relevant opt tests. If this fires on the test-suite (IIRC paq8p could benefit from this?), then people will notice regressions. If it doesn't, is there something we could add to the test-suite that exposes this? Also, I don't think you need -mcpu.

Thanks for the review! Please check my inline comments.

lib/Target/X86/X86ISelLowering.cpp
28588–28600 ↗	(On Diff #48154)	I also need to check if a vector contains all -1s, which could not be satisfied by isBuildVectorAllOnes/isBuildVectorAllZeros. How to use getConstantSplatNode here?
28616–28623 ↗	(On Diff #48154)	I think the case of v2i32 may not be generated by the vectorizer but as we are not just handling the results of auto-vectorization, I think it is better to cover this case. For v16i32 we only need SSE2 as we are actually handling v16i8, and the result will be collected as a v2i64. We need AVX512 to handle v64i32. Therefore, my previous check here is incorrect, which should be if (VT.getSizeInBits() / 4 > RegSize) return SDValue();
28697–28701 ↗	(On Diff #48154)	v16i8 is actually handled by SSE2, while AVX-512 is handling v64i8. This means even on AVX512 we should still use SSE2 instructions to handle v16i8.
test/CodeGen/X86/sad.ll
1–6 ↗	(On Diff #48154)	I just want to avoid writing the test case for different targets and hence the test code size is minimized. But I agree with you on that we should not rely on auto-vectorization for the test of this patch. So the minimal IR here would be several huge functions for different targets?

Update the patch according to Ahmed's comments.

ab added inline comments.Feb 26 2016, 4:42 PM

lib/Target/X86/X86ISelLowering.cpp
28751–28763 ↗	(On Diff #49267)	Right, isBuildVectorAllOnes looks for -1 (not 1), as it checks that "all of the elements are ~0 or undef."
test/CodeGen/X86/sad.ll
2–7 ↗	(On Diff #49267)	Does it need to be huge? After all, this is only testing the PSAD ISel, and the pattern seems pretty minimal; would it work to test variants of your example in detectSADPattern?

Update the patch according to Ahmed's comments.

Both comments addressed. PTAL.

lib/Target/X86/X86ISelLowering.cpp
28751–28763 ↗	(On Diff #49267)	You are right. Updated. Thanks!
test/CodeGen/X86/sad.ll
2–7 ↗	(On Diff #49267)	I need to compose the test so that the reduction vector operations can be detected. But I think I can still write small tests. To test on different targets, I have to split the test files into threes for testing sse2/avx2/avx512 as I found llc cannot parse opt -mattr=+avx512bw generated IR on AVX2.

The checks on the tests are quite poor - I understand that update_llc_test_checks.py output might be too much but please see if you can expand the checks to give a better idea of context.

lib/Target/X86/X86ISelLowering.cpp
28751–28763 ↗	(On Diff #49424)	512-bit PSAD requires AVX512BW (Subtarget.hasBWI()) - plain AVX512 can't handle 512-bit vector elements smaller than 32-bits - please ensure the tests run with avx512f (which will use the AVX2 path) as well as avx512bw.
test/CodeGen/X86/sad.ll
2–7 ↗	(On Diff #49267)	This sounds like a bug - the test case has no target specific intrinsics so we should be able to handle it on every target.

Simon and Ahmed seem to have a good handle on this, just wanted to say...

In D14840#369592, @RKSimon wrote:

The checks on the tests are quite poor - I understand that update_llc_test_checks.py output might be too much but please see if you can expand the checks to give a better idea of context.

I would strongly encourage folks to use update_llc_test_checks.py. If it's not working here, we should make a version that does work. The resulting test accuracy is just such a huge improvement.

In D14840#369663, @chandlerc wrote:

Simon and Ahmed seem to have a good handle on this, just wanted to say...

In D14840#369592, @RKSimon wrote:

The checks on the tests are quite poor - I understand that update_llc_test_checks.py output might be too much but please see if you can expand the checks to give a better idea of context.

I would strongly encourage folks to use update_llc_test_checks.py. If it's not working here, we should make a version that does work. The resulting test accuracy is just such a huge improvement.

I have updated the test file by running update_llc_test_checks.py. Thanks!

lib/Target/X86/X86ISelLowering.cpp
28751–28763 ↗	(On Diff #49424)	You are right! I have updated the test to cover both AVX512BW and AVX512F. Thanks!
test/CodeGen/X86/sad.ll
2–7 ↗	(On Diff #49267)	Yes, this is a bug. When I try to compile sad-avx2.ll with -mattr=+avx512bw, I got an internal error. The bug is from DAGCombiner::visitINSERT_SUBVECTOR which tries to convert a INSERT_SUBVECTOR operation into a CONCAT_VECTORS one. We need to check if two operands of CONCAT_VECTORS have the same type there. This is fixed in this patch. I then could combine those three tests into one file.

Update the patch according to Simon and Chandler's comments.

What would be necessary to enable PSADBW to match for SSE2 in the 32i8/64i8 cases (and AVX2/AVX512F in the 64i8 case)?

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13643 ↗	(On Diff #50367)	We're referencing N->getOperand(1) enough now that we can bring this out as N1.
13650 ↗	(On Diff #50367)	Add this condition to the outer if()
lib/Target/X86/X86ISelLowering.cpp
29046 ↗	(On Diff #50367)	implemented

In D14840#373057, @RKSimon wrote:

What would be necessary to enable PSADBW to match for SSE2 in the 32i8/64i8 cases (and AVX2/AVX512F in the 64i8 case)?

Then we need to consider this case in detectSADPattern: currently we don't handle too long registers. This is ok in aspect of auto-vectorization as we won't get too long vectors, so I am wondering if it is worth the effort to handle those cases. If we want to do it, we need to tell X86 isel how to split X86ISD::PSADBW. What do you think?

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13643 ↗	(On Diff #50367)	Done.
13650 ↗	(On Diff #50367)	OK.
lib/Target/X86/X86ISelLowering.cpp
29046 ↗	(On Diff #50367)	Done.

Update the patch according to Simon's comments.

Ping?

Minor clanup request.

Also, can you get rid of the stack instructions in the test (nounwind?)

lib/Target/X86/X86ISelLowering.cpp
29090 ↗	(On Diff #50490)	DAG.getConstant(0, DL, InVT)

In D14840#378899, @RKSimon wrote:

Minor clanup request.

Also, can you get rid of the stack instructions in the test (nounwind?)

OK. Done. Thanks!

Update the patch according to Simon's comments.

So, I understand the huge tests are required to trigger the reduction recognizer. Have you considered changing that, for the sake of testability? The most obvious thing I can think of is to represent the flag as metadata in IR. That has other benefits:

do the analysis in IR rather than in SelectionDAGBuilder, which is already a big enough mess as it is ;)
only do the analysis for targets that use it: if only X86 can select reduction ops, what's the point in recognizing them elsewhere?

Another (possibly even crappier) alternative: add some "stress-test" commandline flag that overrides the isVectorReduction check on ADDs (or even adds the flag everywhere).

In D14840#385786, @ab wrote:

So, I understand the huge tests are required to trigger the reduction recognizer. Have you considered changing that, for the sake of testability? The most obvious thing I can think of is to represent the flag as metadata in IR. That has other benefits:

do the analysis in IR rather than in SelectionDAGBuilder, which is already a big enough mess as it is ;)

Doing this early means that we need to presuppose what the backend legalization will do, understand target features,etc. The fact that it is hard to test backend components should be addressed by improving those components, not by moving things into the IR level unnecessarily. There are a few places where we do this kind of thing out of necessity (because, for example, we lack sufficient loop analysis capabilities in the backend; PPCCTRLoops is an example), and it's a mess. You're certainly right, however, that SelectionDAGBuilder could use some refactoring (or at least partitioning). It might never happen, however, before the entire thing gets replaced by GlobalISel ;)

only do the analysis for targets that use it: if only X86 can select reduction ops, what's the point in recognizing them elsewhere?

Other targets will use this as well. PowerPC has vsumsws and friends. AArch64 has vaddv, etc.

Another (possibly even crappier) alternative: add some "stress-test" commandline flag that overrides the isVectorReduction check on ADDs (or even adds the flag everywhere).

Honestly, the test cases don't look that bad. The IR, at least, is fairly succinct. It's never going to be super-small because we're matching a non-trivial pattern. The CHECK lines are definitely over constraining, but that's a symptom of the way they're autogenerated. I don't actually like tests like this which seem to strongly cover irrelevant register-alllocation decisions, but that's a larger problem with many tests in the backend. Hand-coded tests making proper use of CHECK-DAG and named regex patterns for allocated registers would be smaller.

LGTM.

This revision is now accepted and ready to land.Apr 26 2016, 5:42 PM

Closed by commit rL267649: Detects the SAD pattern on X86 so that much better code will be emitted once… (authored by conghou). · Explain WhyApr 26 2016, 6:35 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

18 lines

Target/

X86/

X86ISelLowering.cpp

136 lines

test/

CodeGen/

X86/

sad.ll

973 lines

Diff 55151

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,741 Lines • ▼ Show 20 Lines	if (InVal.getOpcode() == ISD::EXTRACT_VECTOR_ELT) {
}		}
}		}

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::visitINSERT_SUBVECTOR(SDNode *N) {		SDValue DAGCombiner::visitINSERT_SUBVECTOR(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);
SDValue N2 = N->getOperand(2);		SDValue N2 = N->getOperand(2);

		if (N0.getValueType() != N1.getValueType())
		return SDValue();

// If the input vector is a concatenation, and the insert replaces		// If the input vector is a concatenation, and the insert replaces
// one of the halves, we can optimize into a single concat_vectors.		// one of the halves, we can optimize into a single concat_vectors.
if (N0.getOpcode() == ISD::CONCAT_VECTORS &&		if (N0.getOpcode() == ISD::CONCAT_VECTORS && N0->getNumOperands() == 2 &&
N0->getNumOperands() == 2 && N2.getOpcode() == ISD::Constant) {		N2.getOpcode() == ISD::Constant) {
APInt InsIdx = cast<ConstantSDNode>(N2)->getAPIntValue();		APInt InsIdx = cast<ConstantSDNode>(N2)->getAPIntValue();
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// Lower half: fold (insert_subvector (concat_vectors X, Y), Z) ->		// Lower half: fold (insert_subvector (concat_vectors X, Y), Z) ->
// (concat_vectors Z, Y)		// (concat_vectors Z, Y)
if (InsIdx == 0)		if (InsIdx == 0)
return DAG.getNode(ISD::CONCAT_VECTORS, SDLoc(N), VT,		return DAG.getNode(ISD::CONCAT_VECTORS, SDLoc(N), VT, N1,
N->getOperand(1), N0.getOperand(1));		N0.getOperand(1));

// Upper half: fold (insert_subvector (concat_vectors X, Y), Z) ->		// Upper half: fold (insert_subvector (concat_vectors X, Y), Z) ->
// (concat_vectors X, Z)		// (concat_vectors X, Z)
if (InsIdx == VT.getVectorNumElements()/2)		if (InsIdx == VT.getVectorNumElements() / 2)
return DAG.getNode(ISD::CONCAT_VECTORS, SDLoc(N), VT,		return DAG.getNode(ISD::CONCAT_VECTORS, SDLoc(N), VT, N0.getOperand(0),
N0.getOperand(0), N->getOperand(1));		N1);
}		}

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::visitFP_TO_FP16(SDNode *N) {		SDValue DAGCombiner::visitFP_TO_FP16(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);

▲ Show 20 Lines • Show All 1,155 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 29,342 Lines • ▼ Show 20 Lines	return DAG.getNode(N->getOpcode() == ISD::SUB ? X86ISD::ADC : X86ISD::SBB,
DL, OtherVal.getValueType(), OtherVal,		DL, OtherVal.getValueType(), OtherVal,
DAG.getConstant(-1ULL, DL, OtherVal.getValueType()),		DAG.getConstant(-1ULL, DL, OtherVal.getValueType()),
NewCmp);		NewCmp);
return DAG.getNode(N->getOpcode() == ISD::SUB ? X86ISD::SBB : X86ISD::ADC,		return DAG.getNode(N->getOpcode() == ISD::SUB ? X86ISD::SBB : X86ISD::ADC,
DL, OtherVal.getValueType(), OtherVal,		DL, OtherVal.getValueType(), OtherVal,
DAG.getConstant(0, DL, OtherVal.getValueType()), NewCmp);		DAG.getConstant(0, DL, OtherVal.getValueType()), NewCmp);
}		}

		static SDValue detectSADPattern(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		SDLoc DL(N);
		EVT VT = N->getValueType(0);
		SDValue Op0 = N->getOperand(0);
		SDValue Op1 = N->getOperand(1);

		if (!VT.isVector() \|\| !VT.isSimple() \|\|
		!(VT.getVectorElementType() == MVT::i32))
		return SDValue();

		unsigned RegSize = 128;
		if (Subtarget.hasBWI())
		RegSize = 512;
		else if (Subtarget.hasAVX2())
		RegSize = 256;

		// We only handle v16i32 for SSE2 / v32i32 for AVX2 / v64i32 for AVX512.
		if (VT.getSizeInBits() / 4 > RegSize)
		return SDValue();

		// Detect the following pattern:
		//
		// 1: %2 = zext <N x i8> %0 to <N x i32>
		// 2: %3 = zext <N x i8> %1 to <N x i32>
		// 3: %4 = sub nsw <N x i32> %2, %3
		// 4: %5 = icmp sgt <N x i32> %4, [0 x N] or [-1 x N]
		// 5: %6 = sub nsw <N x i32> zeroinitializer, %4
		// 6: %7 = select <N x i1> %5, <N x i32> %4, <N x i32> %6
		// 7: %8 = add nsw <N x i32> %7, %vec.phi
		//
		// The last instruction must be a reduction add. The instructions 3-6 forms an
		// ABSDIFF pattern.

		// The two operands of reduction add are from PHI and a select-op as in line 7
		// above.
		SDValue SelectOp, Phi;
		if (Op0.getOpcode() == ISD::VSELECT) {
		SelectOp = Op0;
		Phi = Op1;
		} else if (Op1.getOpcode() == ISD::VSELECT) {
		SelectOp = Op1;
		Phi = Op0;
		} else
		return SDValue();

		// Check the condition of the select instruction is greater-than.
		SDValue SetCC = SelectOp->getOperand(0);
		if (SetCC.getOpcode() != ISD::SETCC)
		return SDValue();
		ISD::CondCode CC = cast<CondCodeSDNode>(SetCC.getOperand(2))->get();
		if (CC != ISD::SETGT)
		return SDValue();

		Op0 = SelectOp->getOperand(1);
		Op1 = SelectOp->getOperand(2);

		// The second operand of SelectOp Op1 is the negation of the first operand
		// Op0, which is implemented as 0 - Op0.
		if (!(Op1.getOpcode() == ISD::SUB &&
		ISD::isBuildVectorAllZeros(Op1.getOperand(0).getNode()) &&
		Op1.getOperand(1) == Op0))
		return SDValue();

		// The first operand of SetCC is the first operand of SelectOp, which is the
		// difference between two input vectors.
		if (SetCC.getOperand(0) != Op0)
		return SDValue();

		// The second operand of > comparison can be either -1 or 0.
		if (!(ISD::isBuildVectorAllZeros(SetCC.getOperand(1).getNode()) \|\|
		ISD::isBuildVectorAllOnes(SetCC.getOperand(1).getNode())))
		return SDValue();

		// The first operand of SelectOp is the difference between two input vectors.
		if (Op0.getOpcode() != ISD::SUB)
		return SDValue();

		Op1 = Op0.getOperand(1);
		Op0 = Op0.getOperand(0);

		// Check if the operands of the diff are zero-extended from vectors of i8.
		if (Op0.getOpcode() != ISD::ZERO_EXTEND \|\|
		Op0.getOperand(0).getValueType().getVectorElementType() != MVT::i8 \|\|
		Op1.getOpcode() != ISD::ZERO_EXTEND \|\|
		Op1.getOperand(0).getValueType().getVectorElementType() != MVT::i8)
		return SDValue();

		// SAD pattern detected. Now build a SAD instruction and an addition for
		// reduction. Note that the number of elments of the result of SAD is less
		// than the number of elements of its input. Therefore, we could only update
		// part of elements in the reduction vector.

		// Legalize the type of the inputs of PSADBW.
		EVT InVT = Op0.getOperand(0).getValueType();
		if (InVT.getSizeInBits() <= 128)
		RegSize = 128;
		else if (InVT.getSizeInBits() <= 256)
		RegSize = 256;

		unsigned NumConcat = RegSize / InVT.getSizeInBits();
		SmallVector<SDValue, 16> Ops(NumConcat, DAG.getConstant(0, DL, InVT));
		Ops[0] = Op0.getOperand(0);
		MVT ExtendedVT = MVT::getVectorVT(MVT::i8, RegSize / 8);
		Op0 = DAG.getNode(ISD::CONCAT_VECTORS, DL, ExtendedVT, Ops);
		Ops[0] = Op1.getOperand(0);
		Op1 = DAG.getNode(ISD::CONCAT_VECTORS, DL, ExtendedVT, Ops);

		// The output of PSADBW is a vector of i64.
		MVT SadVT = MVT::getVectorVT(MVT::i64, RegSize / 64);
		SDValue Sad = DAG.getNode(X86ISD::PSADBW, DL, SadVT, Op0, Op1);

		// We need to turn the vector of i64 into a vector of i32.
		MVT ResVT = MVT::getVectorVT(MVT::i32, RegSize / 32);
		Sad = DAG.getNode(ISD::BITCAST, DL, ResVT, Sad);

		NumConcat = VT.getSizeInBits() / ResVT.getSizeInBits();
		if (NumConcat > 1) {
		// Update part of elements of the reduction vector. This is done by first
		// extracting a sub-vector from it, updating this sub-vector, and inserting
		// it back.
		SDValue SubPhi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ResVT, Phi,
		DAG.getIntPtrConstant(0, DL));
		SDValue Res = DAG.getNode(ISD::ADD, DL, ResVT, Sad, SubPhi);
		return DAG.getNode(ISD::INSERT_SUBVECTOR, DL, VT, Phi, Res,
		DAG.getIntPtrConstant(0, DL));
		} else
		return DAG.getNode(ISD::ADD, DL, VT, Sad, Phi);
		}

static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,		static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
		const SDNodeFlags *Flags = &cast<BinaryWithFlagsSDNode>(N)->Flags;
		if (Flags->hasVectorReduction()) {
		if (SDValue Sad = detectSADPattern(N, DAG, Subtarget))
		return Sad;
		}

EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDValue Op0 = N->getOperand(0);		SDValue Op0 = N->getOperand(0);
SDValue Op1 = N->getOperand(1);		SDValue Op1 = N->getOperand(1);

// Try to synthesize horizontal adds from adds of shuffles.		// Try to synthesize horizontal adds from adds of shuffles.
if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|		if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|
(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&		(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&
isHorizontalBinOp(Op0, Op1, true))		isHorizontalBinOp(Op0, Op1, true))
▲ Show 20 Lines • Show All 1,087 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/sad.ll

				; NOTE: Assertions have been autogenerated by update_llc_test_checks.py
				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 \| FileCheck %s --check-prefix=SSE2
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 \| FileCheck %s --check-prefix=AVX2
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f \| FileCheck %s --check-prefix=AVX512F
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw \| FileCheck %s --check-prefix=AVX512BW

				@a = global [1024 x i8] zeroinitializer, align 16
				@b = global [1024 x i8] zeroinitializer, align 16

				define i32 @sad_16i8() nounwind {
				; SSE2-LABEL: sad_16i8:
				; SSE2: # BB#0: # %entry
				; SSE2-NEXT: pushq %rbp
				; SSE2-NEXT: movq %rsp, %rbp
				; SSE2-NEXT: andq $-64, %rsp
				; SSE2-NEXT: subq $128, %rsp
				; SSE2-NEXT: pxor %xmm0, %xmm0
				; SSE2-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; SSE2-NEXT: pxor %xmm1, %xmm1
				; SSE2-NEXT: pxor %xmm3, %xmm3
				; SSE2-NEXT: pxor %xmm2, %xmm2
				; SSE2-NEXT: .p2align 4, 0x90
				; SSE2-NEXT: .LBB0_1: # %vector.body
				; SSE2-NEXT: # =>This Inner Loop Header: Depth=1
				; SSE2-NEXT: movdqa %xmm0, %xmm4
				; SSE2-NEXT: movdqu a+1024(%rax), %xmm5
				; SSE2-NEXT: movdqu b+1024(%rax), %xmm0
				; SSE2-NEXT: movdqa %xmm4, (%rsp)
				; SSE2-NEXT: movdqa %xmm1, {{[0-9]+}}(%rsp)
				; SSE2-NEXT: movdqa %xmm3, {{[0-9]+}}(%rsp)
				; SSE2-NEXT: movdqa %xmm2, {{[0-9]+}}(%rsp)
				; SSE2-NEXT: psadbw %xmm5, %xmm0
				; SSE2-NEXT: paddd %xmm4, %xmm0
				; SSE2-NEXT: movdqa %xmm0, (%rsp)
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm1
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm3
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm2
				; SSE2-NEXT: addq $4, %rax
				; SSE2-NEXT: jne .LBB0_1
				; SSE2-NEXT: # BB#2: # %middle.block
				; SSE2-NEXT: paddd %xmm3, %xmm0
				; SSE2-NEXT: paddd %xmm2, %xmm1
				; SSE2-NEXT: paddd %xmm0, %xmm1
				; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]
				; SSE2-NEXT: paddd %xmm1, %xmm0
				; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
				; SSE2-NEXT: paddd %xmm0, %xmm1
				; SSE2-NEXT: movd %xmm1, %eax
				; SSE2-NEXT: movq %rbp, %rsp
				; SSE2-NEXT: popq %rbp
				; SSE2-NEXT: retq
				;
				; AVX2-LABEL: sad_16i8:
				; AVX2: # BB#0: # %entry
				; AVX2-NEXT: pushq %rbp
				; AVX2-NEXT: movq %rsp, %rbp
				; AVX2-NEXT: andq $-64, %rsp
				; AVX2-NEXT: subq $128, %rsp
				; AVX2-NEXT: vpxor %ymm0, %ymm0, %ymm0
				; AVX2-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; AVX2-NEXT: vpxor %ymm1, %ymm1, %ymm1
				; AVX2-NEXT: .p2align 4, 0x90
				; AVX2-NEXT: .LBB0_1: # %vector.body
				; AVX2-NEXT: # =>This Inner Loop Header: Depth=1
				; AVX2-NEXT: vmovdqu a+1024(%rax), %xmm2
				; AVX2-NEXT: vmovdqa %ymm0, (%rsp)
				; AVX2-NEXT: vmovdqa %ymm1, {{[0-9]+}}(%rsp)
				; AVX2-NEXT: vpsadbw b+1024(%rax), %xmm2, %xmm1
				; AVX2-NEXT: vpaddd %xmm0, %xmm1, %xmm0
				; AVX2-NEXT: vmovdqa %xmm0, (%rsp)
				; AVX2-NEXT: vmovdqa (%rsp), %ymm0
				; AVX2-NEXT: vmovdqa {{[0-9]+}}(%rsp), %ymm1
				; AVX2-NEXT: addq $4, %rax
				; AVX2-NEXT: jne .LBB0_1
				; AVX2-NEXT: # BB#2: # %middle.block
				; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vphaddd %ymm0, %ymm0, %ymm0
				; AVX2-NEXT: vmovd %xmm0, %eax
				; AVX2-NEXT: movq %rbp, %rsp
				; AVX2-NEXT: popq %rbp
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: sad_16i8:
				; AVX512F: # BB#0: # %entry
				; AVX512F-NEXT: vpxord %zmm0, %zmm0, %zmm0
				; AVX512F-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; AVX512F-NEXT: .p2align 4, 0x90
				; AVX512F-NEXT: .LBB0_1: # %vector.body
				; AVX512F-NEXT: # =>This Inner Loop Header: Depth=1
				; AVX512F-NEXT: vmovdqu a+1024(%rax), %xmm1
				; AVX512F-NEXT: vpsadbw b+1024(%rax), %xmm1, %xmm1
				; AVX512F-NEXT: vpaddd %xmm0, %xmm1, %xmm1
				; AVX512F-NEXT: vinserti32x4 $0, %xmm1, %zmm0, %zmm0
				; AVX512F-NEXT: addq $4, %rax
				; AVX512F-NEXT: jne .LBB0_1
				; AVX512F-NEXT: # BB#2: # %middle.block
				; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: vpunpckhqdq {{.*#+}} zmm1 = zmm0[1,1,3,3,5,5,7,7]
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: movl $1, %eax
				; AVX512F-NEXT: vmovd %eax, %xmm1
				; AVX512F-NEXT: vpermd %zmm0, %zmm1, %zmm1
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: vmovd %xmm0, %eax
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: sad_16i8:
				; AVX512BW: # BB#0: # %entry
				; AVX512BW-NEXT: vpxord %zmm0, %zmm0, %zmm0
				; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; AVX512BW-NEXT: .p2align 4, 0x90
				; AVX512BW-NEXT: .LBB0_1: # %vector.body
				; AVX512BW-NEXT: # =>This Inner Loop Header: Depth=1
				; AVX512BW-NEXT: vmovdqu a+1024(%rax), %xmm1
				; AVX512BW-NEXT: vpsadbw b+1024(%rax), %xmm1, %xmm1
				; AVX512BW-NEXT: vpaddd %xmm0, %xmm1, %xmm1
				; AVX512BW-NEXT: vinserti32x4 $0, %xmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: addq $4, %rax
				; AVX512BW-NEXT: jne .LBB0_1
				; AVX512BW-NEXT: # BB#2: # %middle.block
				; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vpunpckhqdq {{.*#+}} zmm1 = zmm0[1,1,3,3,5,5,7,7]
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: movl $1, %eax
				; AVX512BW-NEXT: vmovd %eax, %xmm1
				; AVX512BW-NEXT: vpermd %zmm0, %zmm1, %zmm1
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vmovd %xmm0, %eax
				; AVX512BW-NEXT: retq
				entry:
				br label %vector.body

				vector.body:
				%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
				%vec.phi = phi <16 x i32> [ zeroinitializer, %entry ], [ %10, %vector.body ]
				%0 = getelementptr inbounds [1024 x i8], [1024 x i8]* @a, i64 0, i64 %index
				%1 = bitcast i8* %0 to <16 x i8>*
				%wide.load = load <16 x i8>, <16 x i8>* %1, align 4
				%2 = zext <16 x i8> %wide.load to <16 x i32>
				%3 = getelementptr inbounds [1024 x i8], [1024 x i8]* @b, i64 0, i64 %index
				%4 = bitcast i8* %3 to <16 x i8>*
				%wide.load1 = load <16 x i8>, <16 x i8>* %4, align 4
				%5 = zext <16 x i8> %wide.load1 to <16 x i32>
				%6 = sub nsw <16 x i32> %2, %5
				%7 = icmp sgt <16 x i32> %6, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%8 = sub nsw <16 x i32> zeroinitializer, %6
				%9 = select <16 x i1> %7, <16 x i32> %6, <16 x i32> %8
				%10 = add nsw <16 x i32> %9, %vec.phi
				%index.next = add i64 %index, 4
				%11 = icmp eq i64 %index.next, 1024
				br i1 %11, label %middle.block, label %vector.body

				middle.block:
				%.lcssa = phi <16 x i32> [ %10, %vector.body ]
				%rdx.shuf = shufflevector <16 x i32> %.lcssa, <16 x i32> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <16 x i32> %.lcssa, %rdx.shuf
				%rdx.shuf2 = shufflevector <16 x i32> %bin.rdx, <16 x i32> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx2 = add <16 x i32> %bin.rdx, %rdx.shuf2
				%rdx.shuf3 = shufflevector <16 x i32> %bin.rdx2, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx3 = add <16 x i32> %bin.rdx2, %rdx.shuf3
				%rdx.shuf4 = shufflevector <16 x i32> %bin.rdx3, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx4 = add <16 x i32> %bin.rdx3, %rdx.shuf4
				%12 = extractelement <16 x i32> %bin.rdx4, i32 0
				ret i32 %12
				}

				define i32 @sad_32i8() nounwind {
				; SSE2-LABEL: sad_32i8:
				; SSE2: # BB#0: # %entry
				; SSE2-NEXT: pxor %xmm12, %xmm12
				; SSE2-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; SSE2-NEXT: pxor %xmm4, %xmm4
				; SSE2-NEXT: pxor %xmm2, %xmm2
				; SSE2-NEXT: pxor %xmm0, %xmm0
				; SSE2-NEXT: pxor %xmm1, %xmm1
				; SSE2-NEXT: pxor %xmm13, %xmm13
				; SSE2-NEXT: pxor %xmm15, %xmm15
				; SSE2-NEXT: pxor %xmm5, %xmm5
				; SSE2-NEXT: pxor %xmm14, %xmm14
				; SSE2-NEXT: .p2align 4, 0x90
				; SSE2-NEXT: .LBB1_1: # %vector.body
				; SSE2-NEXT: # =>This Inner Loop Header: Depth=1
				; SSE2-NEXT: movdqa %xmm5, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm2, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm1, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa a+1040(%rax), %xmm0
				; SSE2-NEXT: movdqa a+1024(%rax), %xmm1
				; SSE2-NEXT: pshufd {{.*#+}} xmm8 = xmm1[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm8 = xmm8[0],xmm12[0],xmm8[1],xmm12[1],xmm8[2],xmm12[2],xmm8[3],xmm12[3],xmm8[4],xmm12[4],xmm8[5],xmm12[5],xmm8[6],xmm12[6],xmm8[7],xmm12[7]
				; SSE2-NEXT: pshufd {{.*#+}} xmm7 = xmm0[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm12[0],xmm7[1],xmm12[1],xmm7[2],xmm12[2],xmm7[3],xmm12[3],xmm7[4],xmm12[4],xmm7[5],xmm12[5],xmm7[6],xmm12[6],xmm7[7],xmm12[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm12[0],xmm1[1],xmm12[1],xmm1[2],xmm12[2],xmm1[3],xmm12[3],xmm1[4],xmm12[4],xmm1[5],xmm12[5],xmm1[6],xmm12[6],xmm1[7],xmm12[7]
				; SSE2-NEXT: movdqa %xmm1, %xmm6
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm12[4],xmm1[5],xmm12[5],xmm1[6],xmm12[6],xmm1[7],xmm12[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm12[0],xmm0[1],xmm12[1],xmm0[2],xmm12[2],xmm0[3],xmm12[3],xmm0[4],xmm12[4],xmm0[5],xmm12[5],xmm0[6],xmm12[6],xmm0[7],xmm12[7]
				; SSE2-NEXT: movdqa %xmm0, %xmm2
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm12[0],xmm2[1],xmm12[1],xmm2[2],xmm12[2],xmm2[3],xmm12[3]
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],xmm12[4],xmm0[5],xmm12[5],xmm0[6],xmm12[6],xmm0[7],xmm12[7]
				; SSE2-NEXT: movdqa b+1040(%rax), %xmm3
				; SSE2-NEXT: movdqa b+1024(%rax), %xmm5
				; SSE2-NEXT: pshufd {{.*#+}} xmm9 = xmm3[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm12[0],xmm3[1],xmm12[1],xmm3[2],xmm12[2],xmm3[3],xmm12[3],xmm3[4],xmm12[4],xmm3[5],xmm12[5],xmm3[6],xmm12[6],xmm3[7],xmm12[7]
				; SSE2-NEXT: movdqa %xmm3, %xmm10
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm12[4],xmm3[5],xmm12[5],xmm3[6],xmm12[6],xmm3[7],xmm12[7]
				; SSE2-NEXT: psubd %xmm3, %xmm0
				; SSE2-NEXT: pshufd {{.*#+}} xmm11 = xmm5[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm5 = xmm5[0],xmm12[0],xmm5[1],xmm12[1],xmm5[2],xmm12[2],xmm5[3],xmm12[3],xmm5[4],xmm12[4],xmm5[5],xmm12[5],xmm5[6],xmm12[6],xmm5[7],xmm12[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm10 = xmm10[0],xmm12[0],xmm10[1],xmm12[1],xmm10[2],xmm12[2],xmm10[3],xmm12[3]
				; SSE2-NEXT: psubd %xmm10, %xmm2
				; SSE2-NEXT: movdqa %xmm5, %xmm3
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm12[4],xmm5[5],xmm12[5],xmm5[6],xmm12[6],xmm5[7],xmm12[7]
				; SSE2-NEXT: psubd %xmm5, %xmm1
				; SSE2-NEXT: movdqa %xmm7, %xmm5
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm12[4],xmm7[5],xmm12[5],xmm7[6],xmm12[6],xmm7[7],xmm12[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm12[0],xmm6[1],xmm12[1],xmm6[2],xmm12[2],xmm6[3],xmm12[3]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm9 = xmm9[0],xmm12[0],xmm9[1],xmm12[1],xmm9[2],xmm12[2],xmm9[3],xmm12[3],xmm9[4],xmm12[4],xmm9[5],xmm12[5],xmm9[6],xmm12[6],xmm9[7],xmm12[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm12[0],xmm3[1],xmm12[1],xmm3[2],xmm12[2],xmm3[3],xmm12[3]
				; SSE2-NEXT: psubd %xmm3, %xmm6
				; SSE2-NEXT: movdqa %xmm4, %xmm10
				; SSE2-NEXT: movdqa %xmm9, %xmm4
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm9 = xmm9[4],xmm12[4],xmm9[5],xmm12[5],xmm9[6],xmm12[6],xmm9[7],xmm12[7]
				; SSE2-NEXT: psubd %xmm9, %xmm7
				; SSE2-NEXT: movdqa %xmm8, %xmm3
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm8 = xmm8[4],xmm12[4],xmm8[5],xmm12[5],xmm8[6],xmm12[6],xmm8[7],xmm12[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm12[0],xmm5[1],xmm12[1],xmm5[2],xmm12[2],xmm5[3],xmm12[3]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm11 = xmm11[0],xmm12[0],xmm11[1],xmm12[1],xmm11[2],xmm12[2],xmm11[3],xmm12[3],xmm11[4],xmm12[4],xmm11[5],xmm12[5],xmm11[6],xmm12[6],xmm11[7],xmm12[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm12[0],xmm4[1],xmm12[1],xmm4[2],xmm12[2],xmm4[3],xmm12[3]
				; SSE2-NEXT: psubd %xmm4, %xmm5
				; SSE2-NEXT: movdqa %xmm11, %xmm4
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm12[4],xmm11[5],xmm12[5],xmm11[6],xmm12[6],xmm11[7],xmm12[7]
				; SSE2-NEXT: psubd %xmm11, %xmm8
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm12[0],xmm3[1],xmm12[1],xmm3[2],xmm12[2],xmm3[3],xmm12[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm12[0],xmm4[1],xmm12[1],xmm4[2],xmm12[2],xmm4[3],xmm12[3]
				; SSE2-NEXT: psubd %xmm4, %xmm3
				; SSE2-NEXT: movdqa %xmm3, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm3
				; SSE2-NEXT: pxor %xmm4, %xmm3
				; SSE2-NEXT: movdqa %xmm8, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm8
				; SSE2-NEXT: pxor %xmm4, %xmm8
				; SSE2-NEXT: movdqa %xmm5, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm5
				; SSE2-NEXT: pxor %xmm4, %xmm5
				; SSE2-NEXT: movdqa %xmm7, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm7
				; SSE2-NEXT: pxor %xmm4, %xmm7
				; SSE2-NEXT: movdqa %xmm6, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm6
				; SSE2-NEXT: pxor %xmm4, %xmm6
				; SSE2-NEXT: movdqa %xmm1, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm1
				; SSE2-NEXT: pxor %xmm4, %xmm1
				; SSE2-NEXT: movdqa %xmm2, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm2
				; SSE2-NEXT: pxor %xmm4, %xmm2
				; SSE2-NEXT: movdqa %xmm0, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm0
				; SSE2-NEXT: pxor %xmm4, %xmm0
				; SSE2-NEXT: movdqa %xmm10, %xmm4
				; SSE2-NEXT: paddd %xmm0, %xmm15
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm2, %xmm13
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm1, %xmm2
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm6, %xmm4
				; SSE2-NEXT: paddd %xmm7, %xmm14
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm6 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm5, %xmm6
				; SSE2-NEXT: movdqa %xmm6, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm5 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm8, %xmm1
				; SSE2-NEXT: paddd %xmm3, %xmm0
				; SSE2-NEXT: addq $4, %rax
				; SSE2-NEXT: jne .LBB1_1
				; SSE2-NEXT: # BB#2: # %middle.block
				; SSE2-NEXT: paddd %xmm15, %xmm2
				; SSE2-NEXT: paddd %xmm14, %xmm1
				; SSE2-NEXT: paddd %xmm13, %xmm4
				; SSE2-NEXT: paddd %xmm5, %xmm0
				; SSE2-NEXT: paddd %xmm4, %xmm0
				; SSE2-NEXT: paddd %xmm2, %xmm1
				; SSE2-NEXT: paddd %xmm0, %xmm1
				; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]
				; SSE2-NEXT: paddd %xmm1, %xmm0
				; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
				; SSE2-NEXT: paddd %xmm0, %xmm1
				; SSE2-NEXT: movd %xmm1, %eax
				; SSE2-NEXT: retq
				;
				; AVX2-LABEL: sad_32i8:
				; AVX2: # BB#0: # %entry
				; AVX2-NEXT: pushq %rbp
				; AVX2-NEXT: movq %rsp, %rbp
				; AVX2-NEXT: andq $-128, %rsp
				; AVX2-NEXT: subq $256, %rsp # imm = 0x100
				; AVX2-NEXT: vpxor %ymm0, %ymm0, %ymm0
				; AVX2-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; AVX2-NEXT: vpxor %ymm1, %ymm1, %ymm1
				; AVX2-NEXT: vpxor %ymm2, %ymm2, %ymm2
				; AVX2-NEXT: vpxor %ymm3, %ymm3, %ymm3
				; AVX2-NEXT: .p2align 4, 0x90
				; AVX2-NEXT: .LBB1_1: # %vector.body
				; AVX2-NEXT: # =>This Inner Loop Header: Depth=1
				; AVX2-NEXT: vmovdqa a+1024(%rax), %ymm4
				; AVX2-NEXT: vmovdqa %ymm0, (%rsp)
				; AVX2-NEXT: vmovdqa %ymm1, {{[0-9]+}}(%rsp)
				; AVX2-NEXT: vmovdqa %ymm2, {{[0-9]+}}(%rsp)
				; AVX2-NEXT: vmovdqa %ymm3, {{[0-9]+}}(%rsp)
				; AVX2-NEXT: vpsadbw b+1024(%rax), %ymm4, %ymm1
				; AVX2-NEXT: vpaddd %ymm0, %ymm1, %ymm0
				; AVX2-NEXT: vmovdqa %ymm0, (%rsp)
				; AVX2-NEXT: vmovdqa {{[0-9]+}}(%rsp), %ymm1
				; AVX2-NEXT: vmovdqa {{[0-9]+}}(%rsp), %ymm2
				; AVX2-NEXT: vmovdqa {{[0-9]+}}(%rsp), %ymm3
				; AVX2-NEXT: addq $4, %rax
				; AVX2-NEXT: jne .LBB1_1
				; AVX2-NEXT: # BB#2: # %middle.block
				; AVX2-NEXT: vpaddd %ymm2, %ymm0, %ymm0
				; AVX2-NEXT: vpaddd %ymm3, %ymm1, %ymm1
				; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vphaddd %ymm0, %ymm0, %ymm0
				; AVX2-NEXT: vmovd %xmm0, %eax
				; AVX2-NEXT: movq %rbp, %rsp
				; AVX2-NEXT: popq %rbp
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: sad_32i8:
				; AVX512F: # BB#0: # %entry
				; AVX512F-NEXT: pushq %rbp
				; AVX512F-NEXT: movq %rsp, %rbp
				; AVX512F-NEXT: andq $-128, %rsp
				; AVX512F-NEXT: subq $256, %rsp # imm = 0x100
				; AVX512F-NEXT: vpxord %zmm0, %zmm0, %zmm0
				; AVX512F-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; AVX512F-NEXT: vpxord %zmm1, %zmm1, %zmm1
				; AVX512F-NEXT: .p2align 4, 0x90
				; AVX512F-NEXT: .LBB1_1: # %vector.body
				; AVX512F-NEXT: # =>This Inner Loop Header: Depth=1
				; AVX512F-NEXT: vmovdqa a+1024(%rax), %ymm2
				; AVX512F-NEXT: vmovdqa32 %zmm0, (%rsp)
				; AVX512F-NEXT: vmovdqa32 %zmm1, {{[0-9]+}}(%rsp)
				; AVX512F-NEXT: vpsadbw b+1024(%rax), %ymm2, %ymm1
				; AVX512F-NEXT: vpaddd %ymm0, %ymm1, %ymm0
				; AVX512F-NEXT: vmovdqa %ymm0, (%rsp)
				; AVX512F-NEXT: vmovdqa32 {{[0-9]+}}(%rsp), %zmm1
				; AVX512F-NEXT: vmovdqa32 (%rsp), %zmm0
				; AVX512F-NEXT: addq $4, %rax
				; AVX512F-NEXT: jne .LBB1_1
				; AVX512F-NEXT: # BB#2: # %middle.block
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: vpunpckhqdq {{.*#+}} zmm1 = zmm0[1,1,3,3,5,5,7,7]
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: movl $1, %eax
				; AVX512F-NEXT: vmovd %eax, %xmm1
				; AVX512F-NEXT: vpermd %zmm0, %zmm1, %zmm1
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: vmovd %xmm0, %eax
				; AVX512F-NEXT: movq %rbp, %rsp
				; AVX512F-NEXT: popq %rbp
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: sad_32i8:
				; AVX512BW: # BB#0: # %entry
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-128, %rsp
				; AVX512BW-NEXT: subq $256, %rsp # imm = 0x100
				; AVX512BW-NEXT: vpxord %zmm0, %zmm0, %zmm0
				; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; AVX512BW-NEXT: vpxord %zmm1, %zmm1, %zmm1
				; AVX512BW-NEXT: .p2align 4, 0x90
				; AVX512BW-NEXT: .LBB1_1: # %vector.body
				; AVX512BW-NEXT: # =>This Inner Loop Header: Depth=1
				; AVX512BW-NEXT: vmovdqa a+1024(%rax), %ymm2
				; AVX512BW-NEXT: vmovdqa32 %zmm0, (%rsp)
				; AVX512BW-NEXT: vmovdqa32 %zmm1, {{[0-9]+}}(%rsp)
				; AVX512BW-NEXT: vpsadbw b+1024(%rax), %ymm2, %ymm1
				; AVX512BW-NEXT: vpaddd %ymm0, %ymm1, %ymm0
				; AVX512BW-NEXT: vmovdqa %ymm0, (%rsp)
				; AVX512BW-NEXT: vmovdqa32 {{[0-9]+}}(%rsp), %zmm1
				; AVX512BW-NEXT: vmovdqa32 (%rsp), %zmm0
				; AVX512BW-NEXT: addq $4, %rax
				; AVX512BW-NEXT: jne .LBB1_1
				; AVX512BW-NEXT: # BB#2: # %middle.block
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vpunpckhqdq {{.*#+}} zmm1 = zmm0[1,1,3,3,5,5,7,7]
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: movl $1, %eax
				; AVX512BW-NEXT: vmovd %eax, %xmm1
				; AVX512BW-NEXT: vpermd %zmm0, %zmm1, %zmm1
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vmovd %xmm0, %eax
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
				; AVX512BW-NEXT: retq
				entry:
				br label %vector.body

				vector.body:
				%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
				%vec.phi = phi <32 x i32> [ zeroinitializer, %entry ], [ %10, %vector.body ]
				%0 = getelementptr inbounds [1024 x i8], [1024 x i8]* @a, i64 0, i64 %index
				%1 = bitcast i8* %0 to <32 x i8>*
				%wide.load = load <32 x i8>, <32 x i8>* %1, align 32
				%2 = zext <32 x i8> %wide.load to <32 x i32>
				%3 = getelementptr inbounds [1024 x i8], [1024 x i8]* @b, i64 0, i64 %index
				%4 = bitcast i8* %3 to <32 x i8>*
				%wide.load1 = load <32 x i8>, <32 x i8>* %4, align 32
				%5 = zext <32 x i8> %wide.load1 to <32 x i32>
				%6 = sub nsw <32 x i32> %2, %5
				%7 = icmp sgt <32 x i32> %6, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%8 = sub nsw <32 x i32> zeroinitializer, %6
				%9 = select <32 x i1> %7, <32 x i32> %6, <32 x i32> %8
				%10 = add nsw <32 x i32> %9, %vec.phi
				%index.next = add i64 %index, 4
				%11 = icmp eq i64 %index.next, 1024
				br i1 %11, label %middle.block, label %vector.body

				middle.block:
				%.lcssa = phi <32 x i32> [ %10, %vector.body ]
				%rdx.shuf = shufflevector <32 x i32> %.lcssa, <32 x i32> undef, <32 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <32 x i32> %.lcssa, %rdx.shuf
				%rdx.shuf2 = shufflevector <32 x i32> %bin.rdx, <32 x i32> undef, <32 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx2 = add <32 x i32> %bin.rdx, %rdx.shuf2
				%rdx.shuf3 = shufflevector <32 x i32> %bin.rdx2, <32 x i32> undef, <32 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx3 = add <32 x i32> %bin.rdx2, %rdx.shuf3
				%rdx.shuf4 = shufflevector <32 x i32> %bin.rdx3, <32 x i32> undef, <32 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx4 = add <32 x i32> %bin.rdx3, %rdx.shuf4
				%rdx.shuf5 = shufflevector <32 x i32> %bin.rdx4, <32 x i32> undef, <32 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx5 = add <32 x i32> %bin.rdx4, %rdx.shuf5
				%12 = extractelement <32 x i32> %bin.rdx5, i32 0
				ret i32 %12
				}

				define i32 @sad_avx64i8() nounwind {
				; SSE2-LABEL: sad_avx64i8:
				; SSE2: # BB#0: # %entry
				; SSE2-NEXT: subq $232, %rsp
				; SSE2-NEXT: pxor %xmm8, %xmm8
				; SSE2-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; SSE2-NEXT: pxor %xmm5, %xmm5
				; SSE2-NEXT: pxor %xmm2, %xmm2
				; SSE2-NEXT: pxor %xmm1, %xmm1
				; SSE2-NEXT: pxor %xmm3, %xmm3
				; SSE2-NEXT: pxor %xmm6, %xmm6
				; SSE2-NEXT: pxor %xmm0, %xmm0
				; SSE2-NEXT: movdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: pxor %xmm13, %xmm13
				; SSE2-NEXT: pxor %xmm10, %xmm10
				; SSE2-NEXT: pxor %xmm0, %xmm0
				; SSE2-NEXT: movdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: pxor %xmm12, %xmm12
				; SSE2-NEXT: pxor %xmm0, %xmm0
				; SSE2-NEXT: movdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: pxor %xmm11, %xmm11
				; SSE2-NEXT: pxor %xmm15, %xmm15
				; SSE2-NEXT: pxor %xmm9, %xmm9
				; SSE2-NEXT: pxor %xmm7, %xmm7
				; SSE2-NEXT: pxor %xmm0, %xmm0
				; SSE2-NEXT: .p2align 4, 0x90
				; SSE2-NEXT: .LBB2_1: # %vector.body
				; SSE2-NEXT: # =>This Inner Loop Header: Depth=1
				; SSE2-NEXT: movdqa %xmm12, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm7, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm15, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm0, (%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm11, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm9, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm3, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm2, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm6, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm5, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm13, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm10, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa a+1040(%rax), %xmm13
				; SSE2-NEXT: movdqa a+1024(%rax), %xmm1
				; SSE2-NEXT: movdqa a+1056(%rax), %xmm3
				; SSE2-NEXT: movdqa a+1072(%rax), %xmm6
				; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm3[2,3,0,1]
				; SSE2-NEXT: movdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm8[0],xmm3[1],xmm8[1],xmm3[2],xmm8[2],xmm3[3],xmm8[3],xmm3[4],xmm8[4],xmm3[5],xmm8[5],xmm3[6],xmm8[6],xmm3[7],xmm8[7]
				; SSE2-NEXT: movdqa %xmm3, %xmm12
				; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm8[0],xmm2[1],xmm8[1],xmm2[2],xmm8[2],xmm2[3],xmm8[3],xmm2[4],xmm8[4],xmm2[5],xmm8[5],xmm2[6],xmm8[6],xmm2[7],xmm8[7]
				; SSE2-NEXT: pshufd {{.*#+}} xmm10 = xmm13[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm10 = xmm10[0],xmm8[0],xmm10[1],xmm8[1],xmm10[2],xmm8[2],xmm10[3],xmm8[3],xmm10[4],xmm8[4],xmm10[5],xmm8[5],xmm10[6],xmm8[6],xmm10[7],xmm8[7]
				; SSE2-NEXT: movdqa %xmm10, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm10 = xmm10[4],xmm8[4],xmm10[5],xmm8[5],xmm10[6],xmm8[6],xmm10[7],xmm8[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm8[0],xmm3[1],xmm8[1],xmm3[2],xmm8[2],xmm3[3],xmm8[3]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm8[0],xmm1[1],xmm8[1],xmm1[2],xmm8[2],xmm1[3],xmm8[3],xmm1[4],xmm8[4],xmm1[5],xmm8[5],xmm1[6],xmm8[6],xmm1[7],xmm8[7]
				; SSE2-NEXT: movdqa %xmm1, %xmm0
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm8[0],xmm0[1],xmm8[1],xmm0[2],xmm8[2],xmm0[3],xmm8[3]
				; SSE2-NEXT: movdqa %xmm0, %xmm15
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm8[4],xmm1[5],xmm8[5],xmm1[6],xmm8[6],xmm1[7],xmm8[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm13 = xmm13[0],xmm8[0],xmm13[1],xmm8[1],xmm13[2],xmm8[2],xmm13[3],xmm8[3],xmm13[4],xmm8[4],xmm13[5],xmm8[5],xmm13[6],xmm8[6],xmm13[7],xmm8[7]
				; SSE2-NEXT: movdqa %xmm13, %xmm0
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm8[0],xmm0[1],xmm8[1],xmm0[2],xmm8[2],xmm0[3],xmm8[3]
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm13 = xmm13[4],xmm8[4],xmm13[5],xmm8[5],xmm13[6],xmm8[6],xmm13[7],xmm8[7]
				; SSE2-NEXT: movdqa b+1040(%rax), %xmm7
				; SSE2-NEXT: movdqa b+1024(%rax), %xmm11
				; SSE2-NEXT: movdqa b+1056(%rax), %xmm9
				; SSE2-NEXT: pshufd {{.*#+}} xmm5 = xmm7[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm8[0],xmm7[1],xmm8[1],xmm7[2],xmm8[2],xmm7[3],xmm8[3],xmm7[4],xmm8[4],xmm7[5],xmm8[5],xmm7[6],xmm8[6],xmm7[7],xmm8[7]
				; SSE2-NEXT: movdqa %xmm7, %xmm4
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm8[4],xmm7[5],xmm8[5],xmm7[6],xmm8[6],xmm7[7],xmm8[7]
				; SSE2-NEXT: psubd %xmm7, %xmm13
				; SSE2-NEXT: pshufd {{.*#+}} xmm7 = xmm11[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm11 = xmm11[0],xmm8[0],xmm11[1],xmm8[1],xmm11[2],xmm8[2],xmm11[3],xmm8[3],xmm11[4],xmm8[4],xmm11[5],xmm8[5],xmm11[6],xmm8[6],xmm11[7],xmm8[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm8[0],xmm4[1],xmm8[1],xmm4[2],xmm8[2],xmm4[3],xmm8[3]
				; SSE2-NEXT: psubd %xmm4, %xmm0
				; SSE2-NEXT: movdqa %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm11, %xmm4
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm8[4],xmm11[5],xmm8[5],xmm11[6],xmm8[6],xmm11[7],xmm8[7]
				; SSE2-NEXT: psubd %xmm11, %xmm1
				; SSE2-NEXT: pshufd {{.*#+}} xmm14 = xmm9[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm9 = xmm9[0],xmm8[0],xmm9[1],xmm8[1],xmm9[2],xmm8[2],xmm9[3],xmm8[3],xmm9[4],xmm8[4],xmm9[5],xmm8[5],xmm9[6],xmm8[6],xmm9[7],xmm8[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm8[0],xmm4[1],xmm8[1],xmm4[2],xmm8[2],xmm4[3],xmm8[3]
				; SSE2-NEXT: psubd %xmm4, %xmm15
				; SSE2-NEXT: movdqa %xmm15, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm9, %xmm4
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm5 = xmm5[0],xmm8[0],xmm5[1],xmm8[1],xmm5[2],xmm8[2],xmm5[3],xmm8[3],xmm5[4],xmm8[4],xmm5[5],xmm8[5],xmm5[6],xmm8[6],xmm5[7],xmm8[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm9 = xmm9[0],xmm8[0],xmm9[1],xmm8[1],xmm9[2],xmm8[2],xmm9[3],xmm8[3]
				; SSE2-NEXT: psubd %xmm9, %xmm3
				; SSE2-NEXT: movdqa %xmm5, %xmm0
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm8[4],xmm5[5],xmm8[5],xmm5[6],xmm8[6],xmm5[7],xmm8[7]
				; SSE2-NEXT: psubd %xmm5, %xmm10
				; SSE2-NEXT: movdqa %xmm2, %xmm15
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm8[4],xmm2[5],xmm8[5],xmm2[6],xmm8[6],xmm2[7],xmm8[7]
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm5 # 16-byte Reload
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm8[0],xmm5[1],xmm8[1],xmm5[2],xmm8[2],xmm5[3],xmm8[3]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm8[0],xmm7[1],xmm8[1],xmm7[2],xmm8[2],xmm7[3],xmm8[3],xmm7[4],xmm8[4],xmm7[5],xmm8[5],xmm7[6],xmm8[6],xmm7[7],xmm8[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm8[0],xmm0[1],xmm8[1],xmm0[2],xmm8[2],xmm0[3],xmm8[3]
				; SSE2-NEXT: psubd %xmm0, %xmm5
				; SSE2-NEXT: movdqa %xmm5, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm7, %xmm0
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm8[4],xmm7[5],xmm8[5],xmm7[6],xmm8[6],xmm7[7],xmm8[7]
				; SSE2-NEXT: psubd %xmm7, %xmm2
				; SSE2-NEXT: movdqa %xmm2, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: pshufd {{.*#+}} xmm7 = xmm6[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm6 = xmm6[0],xmm8[0],xmm6[1],xmm8[1],xmm6[2],xmm8[2],xmm6[3],xmm8[3],xmm6[4],xmm8[4],xmm6[5],xmm8[5],xmm6[6],xmm8[6],xmm6[7],xmm8[7]
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm8[0],xmm2[1],xmm8[1],xmm2[2],xmm8[2],xmm2[3],xmm8[3],xmm2[4],xmm8[4],xmm2[5],xmm8[5],xmm2[6],xmm8[6],xmm2[7],xmm8[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm15 = xmm15[0],xmm8[0],xmm15[1],xmm8[1],xmm15[2],xmm8[2],xmm15[3],xmm8[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm8[0],xmm0[1],xmm8[1],xmm0[2],xmm8[2],xmm0[3],xmm8[3]
				; SSE2-NEXT: psubd %xmm0, %xmm15
				; SSE2-NEXT: movdqa %xmm2, %xmm11
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm8[0],xmm2[1],xmm8[1],xmm2[2],xmm8[2],xmm2[3],xmm8[3]
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm8[4],xmm12[5],xmm8[5],xmm12[6],xmm8[6],xmm12[7],xmm8[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm14 = xmm14[0],xmm8[0],xmm14[1],xmm8[1],xmm14[2],xmm8[2],xmm14[3],xmm8[3],xmm14[4],xmm8[4],xmm14[5],xmm8[5],xmm14[6],xmm8[6],xmm14[7],xmm8[7]
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm8[4],xmm4[5],xmm8[5],xmm4[6],xmm8[6],xmm4[7],xmm8[7]
				; SSE2-NEXT: psubd %xmm4, %xmm12
				; SSE2-NEXT: movdqa %xmm14, %xmm0
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm14 = xmm14[0],xmm8[0],xmm14[1],xmm8[1],xmm14[2],xmm8[2],xmm14[3],xmm8[3]
				; SSE2-NEXT: psubd %xmm14, %xmm2
				; SSE2-NEXT: movdqa %xmm2, %xmm14
				; SSE2-NEXT: movdqa %xmm6, %xmm9
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm8[0],xmm6[1],xmm8[1],xmm6[2],xmm8[2],xmm6[3],xmm8[3]
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm8[4],xmm11[5],xmm8[5],xmm11[6],xmm8[6],xmm11[7],xmm8[7]
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],xmm8[4],xmm0[5],xmm8[5],xmm0[6],xmm8[6],xmm0[7],xmm8[7]
				; SSE2-NEXT: psubd %xmm0, %xmm11
				; SSE2-NEXT: movdqa b+1072(%rax), %xmm0
				; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm0[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm8[0],xmm0[1],xmm8[1],xmm0[2],xmm8[2],xmm0[3],xmm8[3],xmm0[4],xmm8[4],xmm0[5],xmm8[5],xmm0[6],xmm8[6],xmm0[7],xmm8[7]
				; SSE2-NEXT: movdqa %xmm0, %xmm5
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm8[0],xmm0[1],xmm8[1],xmm0[2],xmm8[2],xmm0[3],xmm8[3]
				; SSE2-NEXT: psubd %xmm0, %xmm6
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm9 = xmm9[4],xmm8[4],xmm9[5],xmm8[5],xmm9[6],xmm8[6],xmm9[7],xmm8[7]
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm8[4],xmm5[5],xmm8[5],xmm5[6],xmm8[6],xmm5[7],xmm8[7]
				; SSE2-NEXT: psubd %xmm5, %xmm9
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm8[0],xmm7[1],xmm8[1],xmm7[2],xmm8[2],xmm7[3],xmm8[3],xmm7[4],xmm8[4],xmm7[5],xmm8[5],xmm7[6],xmm8[6],xmm7[7],xmm8[7]
				; SSE2-NEXT: movdqa %xmm7, %xmm0
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm8[0],xmm7[1],xmm8[1],xmm7[2],xmm8[2],xmm7[3],xmm8[3]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm8[0],xmm4[1],xmm8[1],xmm4[2],xmm8[2],xmm4[3],xmm8[3],xmm4[4],xmm8[4],xmm4[5],xmm8[5],xmm4[6],xmm8[6],xmm4[7],xmm8[7]
				; SSE2-NEXT: movdqa %xmm4, %xmm5
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm8[0],xmm4[1],xmm8[1],xmm4[2],xmm8[2],xmm4[3],xmm8[3]
				; SSE2-NEXT: psubd %xmm4, %xmm7
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],xmm8[4],xmm0[5],xmm8[5],xmm0[6],xmm8[6],xmm0[7],xmm8[7]
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm8[4],xmm5[5],xmm8[5],xmm5[6],xmm8[6],xmm5[7],xmm8[7]
				; SSE2-NEXT: psubd %xmm5, %xmm0
				; SSE2-NEXT: movdqa %xmm0, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm0
				; SSE2-NEXT: pxor %xmm4, %xmm0
				; SSE2-NEXT: movdqa %xmm7, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm7
				; SSE2-NEXT: pxor %xmm4, %xmm7
				; SSE2-NEXT: movdqa %xmm9, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm9
				; SSE2-NEXT: pxor %xmm4, %xmm9
				; SSE2-NEXT: movdqa %xmm6, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm6
				; SSE2-NEXT: pxor %xmm4, %xmm6
				; SSE2-NEXT: movdqa %xmm6, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm11, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm11
				; SSE2-NEXT: pxor %xmm4, %xmm11
				; SSE2-NEXT: movdqa %xmm14, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm14
				; SSE2-NEXT: pxor %xmm4, %xmm14
				; SSE2-NEXT: movdqa %xmm12, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm12
				; SSE2-NEXT: pxor %xmm4, %xmm12
				; SSE2-NEXT: movdqa %xmm12, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm15, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm15
				; SSE2-NEXT: pxor %xmm4, %xmm15
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
				; SSE2-NEXT: movdqa %xmm2, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm2
				; SSE2-NEXT: pxor %xmm4, %xmm2
				; SSE2-NEXT: movdqa %xmm2, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
				; SSE2-NEXT: movdqa %xmm2, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm2
				; SSE2-NEXT: pxor %xmm4, %xmm2
				; SSE2-NEXT: movdqa %xmm2, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm10, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm10
				; SSE2-NEXT: pxor %xmm4, %xmm10
				; SSE2-NEXT: movdqa %xmm3, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm3
				; SSE2-NEXT: pxor %xmm4, %xmm3
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
				; SSE2-NEXT: movdqa %xmm2, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm2
				; SSE2-NEXT: pxor %xmm4, %xmm2
				; SSE2-NEXT: movdqa %xmm2, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa %xmm1, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm1
				; SSE2-NEXT: pxor %xmm4, %xmm1
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
				; SSE2-NEXT: movdqa %xmm2, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm2
				; SSE2-NEXT: pxor %xmm4, %xmm2
				; SSE2-NEXT: movdqa %xmm2, %xmm5
				; SSE2-NEXT: movdqa %xmm13, %xmm4
				; SSE2-NEXT: psrad $31, %xmm4
				; SSE2-NEXT: paddd %xmm4, %xmm13
				; SSE2-NEXT: pxor %xmm4, %xmm13
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm13, %xmm2
				; SSE2-NEXT: movdqa %xmm2, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm12 # 16-byte Reload
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm6 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm5, %xmm6
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm1, %xmm4
				; SSE2-NEXT: movdqa %xmm4, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm5 # 16-byte Reload
				; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm5 # 16-byte Folded Reload
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm3, %xmm4
				; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm3 # 16-byte Reload
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm10, %xmm4
				; SSE2-NEXT: movdqa %xmm4, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm13 # 16-byte Reload
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm10 # 16-byte Reload
				; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm13 # 16-byte Folded Reload
				; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm3 # 16-byte Folded Reload
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm15, %xmm1
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm15 # 16-byte Reload
				; SSE2-NEXT: paddd {{[0-9]+}}(%rsp), %xmm12 # 16-byte Folded Reload
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm14, %xmm4
				; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm11, %xmm4
				; SSE2-NEXT: movdqa %xmm4, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm11 # 16-byte Reload
				; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm15 # 16-byte Folded Reload
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm9, %xmm4
				; SSE2-NEXT: movdqa %xmm4, {{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa {{[0-9]+}}(%rsp), %xmm9 # 16-byte Reload
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm7, %xmm4
				; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm7 # 16-byte Reload
				; SSE2-NEXT: movdqa (%rsp), %xmm4 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm0, %xmm4
				; SSE2-NEXT: movdqa %xmm4, (%rsp) # 16-byte Spill
				; SSE2-NEXT: movdqa (%rsp), %xmm0 # 16-byte Reload
				; SSE2-NEXT: addq $4, %rax
				; SSE2-NEXT: jne .LBB2_1
				; SSE2-NEXT: # BB#2: # %middle.block
				; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload
				; SSE2-NEXT: paddd %xmm7, %xmm13
				; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm5 # 16-byte Folded Reload
				; SSE2-NEXT: paddd %xmm15, %xmm6
				; SSE2-NEXT: paddd %xmm11, %xmm3
				; SSE2-NEXT: paddd %xmm0, %xmm10
				; SSE2-NEXT: paddd %xmm12, %xmm2
				; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload
				; SSE2-NEXT: paddd %xmm9, %xmm0
				; SSE2-NEXT: paddd %xmm2, %xmm0
				; SSE2-NEXT: paddd %xmm3, %xmm10
				; SSE2-NEXT: paddd %xmm5, %xmm6
				; SSE2-NEXT: paddd %xmm1, %xmm13
				; SSE2-NEXT: paddd %xmm6, %xmm13
				; SSE2-NEXT: paddd %xmm0, %xmm10
				; SSE2-NEXT: paddd %xmm13, %xmm10
				; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm10[2,3,0,1]
				; SSE2-NEXT: paddd %xmm10, %xmm0
				; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
				; SSE2-NEXT: paddd %xmm0, %xmm1
				; SSE2-NEXT: movd %xmm1, %eax
				; SSE2-NEXT: addq $232, %rsp
				; SSE2-NEXT: retq
				;
				; AVX2-LABEL: sad_avx64i8:
				; AVX2: # BB#0: # %entry
				; AVX2-NEXT: vpxor %ymm0, %ymm0, %ymm0
				; AVX2-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; AVX2-NEXT: vpxor %ymm2, %ymm2, %ymm2
				; AVX2-NEXT: vpxor %ymm1, %ymm1, %ymm1
				; AVX2-NEXT: vpxor %ymm3, %ymm3, %ymm3
				; AVX2-NEXT: vpxor %ymm4, %ymm4, %ymm4
				; AVX2-NEXT: vpxor %ymm6, %ymm6, %ymm6
				; AVX2-NEXT: vpxor %ymm5, %ymm5, %ymm5
				; AVX2-NEXT: vpxor %ymm7, %ymm7, %ymm7
				; AVX2-NEXT: .p2align 4, 0x90
				; AVX2-NEXT: .LBB2_1: # %vector.body
				; AVX2-NEXT: # =>This Inner Loop Header: Depth=1
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vmovdqu %ymm8, -{{[0-9]+}}(%rsp) # 32-byte Spill
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm9 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm10 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm11 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm12 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm13 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm14 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm15 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpsubd %ymm8, %ymm15, %ymm8
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm15 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpsubd %ymm15, %ymm14, %ymm14
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm15 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpsubd %ymm15, %ymm13, %ymm13
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm15 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpsubd %ymm15, %ymm12, %ymm12
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm15 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpsubd %ymm15, %ymm11, %ymm11
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm15 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpsubd %ymm15, %ymm10, %ymm10
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm15 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vpsubd %ymm15, %ymm9, %ymm9
				; AVX2-NEXT: vmovdqu %ymm9, -{{[0-9]+}}(%rsp) # 32-byte Spill
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm15 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
				; AVX2-NEXT: vmovdqu -{{[0-9]+}}(%rsp), %ymm9 # 32-byte Reload
				; AVX2-NEXT: vpsubd %ymm15, %ymm9, %ymm15
				; AVX2-NEXT: vpabsd %ymm8, %ymm8
				; AVX2-NEXT: vpaddd %ymm3, %ymm8, %ymm3
				; AVX2-NEXT: vpabsd %ymm14, %ymm8
				; AVX2-NEXT: vpaddd %ymm1, %ymm8, %ymm1
				; AVX2-NEXT: vpabsd %ymm13, %ymm8
				; AVX2-NEXT: vpaddd %ymm2, %ymm8, %ymm2
				; AVX2-NEXT: vpabsd %ymm12, %ymm8
				; AVX2-NEXT: vpaddd %ymm0, %ymm8, %ymm0
				; AVX2-NEXT: vpabsd %ymm11, %ymm8
				; AVX2-NEXT: vpaddd %ymm4, %ymm8, %ymm4
				; AVX2-NEXT: vpabsd %ymm10, %ymm8
				; AVX2-NEXT: vpaddd %ymm6, %ymm8, %ymm6
				; AVX2-NEXT: vpabsd -{{[0-9]+}}(%rsp), %ymm8 # 32-byte Folded Reload
				; AVX2-NEXT: vpaddd %ymm5, %ymm8, %ymm5
				; AVX2-NEXT: vpabsd %ymm15, %ymm8
				; AVX2-NEXT: vpaddd %ymm7, %ymm8, %ymm7
				; AVX2-NEXT: addq $4, %rax
				; AVX2-NEXT: jne .LBB2_1
				; AVX2-NEXT: # BB#2: # %middle.block
				; AVX2-NEXT: vpaddd %ymm6, %ymm2, %ymm2
				; AVX2-NEXT: vpaddd %ymm7, %ymm3, %ymm3
				; AVX2-NEXT: vpaddd %ymm4, %ymm0, %ymm0
				; AVX2-NEXT: vpaddd %ymm5, %ymm1, %ymm1
				; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vpaddd %ymm3, %ymm2, %ymm1
				; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vphaddd %ymm0, %ymm0, %ymm0
				; AVX2-NEXT: vmovd %xmm0, %eax
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: sad_avx64i8:
				; AVX512F: # BB#0: # %entry
				; AVX512F-NEXT: vpxord %zmm0, %zmm0, %zmm0
				; AVX512F-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; AVX512F-NEXT: vpxord %zmm1, %zmm1, %zmm1
				; AVX512F-NEXT: vpxord %zmm2, %zmm2, %zmm2
				; AVX512F-NEXT: vpxord %zmm3, %zmm3, %zmm3
				; AVX512F-NEXT: .p2align 4, 0x90
				; AVX512F-NEXT: .LBB2_1: # %vector.body
				; AVX512F-NEXT: # =>This Inner Loop Header: Depth=1
				; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
				; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
				; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
				; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
				; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
				; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm9 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
				; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm10 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
				; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm11 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
				; AVX512F-NEXT: vpsubd %zmm11, %zmm7, %zmm7
				; AVX512F-NEXT: vpsubd %zmm10, %zmm6, %zmm6
				; AVX512F-NEXT: vpsubd %zmm9, %zmm5, %zmm5
				; AVX512F-NEXT: vpsubd %zmm8, %zmm4, %zmm4
				; AVX512F-NEXT: vpabsd %zmm4, %zmm4
				; AVX512F-NEXT: vpabsd %zmm5, %zmm5
				; AVX512F-NEXT: vpabsd %zmm6, %zmm6
				; AVX512F-NEXT: vpabsd %zmm7, %zmm7
				; AVX512F-NEXT: vpaddd %zmm3, %zmm7, %zmm3
				; AVX512F-NEXT: vpaddd %zmm2, %zmm6, %zmm2
				; AVX512F-NEXT: vpaddd %zmm1, %zmm5, %zmm1
				; AVX512F-NEXT: vpaddd %zmm0, %zmm4, %zmm0
				; AVX512F-NEXT: addq $4, %rax
				; AVX512F-NEXT: jne .LBB2_1
				; AVX512F-NEXT: # BB#2: # %middle.block
				; AVX512F-NEXT: vpaddd %zmm2, %zmm0, %zmm0
				; AVX512F-NEXT: vpaddd %zmm3, %zmm1, %zmm1
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: vpunpckhqdq {{.*#+}} zmm1 = zmm0[1,1,3,3,5,5,7,7]
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: movl $1, %eax
				; AVX512F-NEXT: vmovd %eax, %xmm1
				; AVX512F-NEXT: vpermd %zmm0, %zmm1, %zmm1
				; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512F-NEXT: vmovd %xmm0, %eax
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: sad_avx64i8:
				; AVX512BW: # BB#0: # %entry
				; AVX512BW-NEXT: pushq %rbp
				; AVX512BW-NEXT: movq %rsp, %rbp
				; AVX512BW-NEXT: andq $-256, %rsp
				; AVX512BW-NEXT: subq $512, %rsp # imm = 0x200
				; AVX512BW-NEXT: vpxord %zmm0, %zmm0, %zmm0
				; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
				; AVX512BW-NEXT: vpxord %zmm2, %zmm2, %zmm2
				; AVX512BW-NEXT: vpxord %zmm3, %zmm3, %zmm3
				; AVX512BW-NEXT: vpxord %zmm1, %zmm1, %zmm1
				; AVX512BW-NEXT: .p2align 4, 0x90
				; AVX512BW-NEXT: .LBB2_1: # %vector.body
				; AVX512BW-NEXT: # =>This Inner Loop Header: Depth=1
				; AVX512BW-NEXT: vmovdqu8 a+1024(%rax), %zmm4
				; AVX512BW-NEXT: vmovdqa32 %zmm0, (%rsp)
				; AVX512BW-NEXT: vmovdqa32 %zmm2, {{[0-9]+}}(%rsp)
				; AVX512BW-NEXT: vmovdqa32 %zmm3, {{[0-9]+}}(%rsp)
				; AVX512BW-NEXT: vmovdqa32 %zmm1, {{[0-9]+}}(%rsp)
				; AVX512BW-NEXT: vpsadbw b+1024(%rax), %zmm4, %zmm1
				; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0
				; AVX512BW-NEXT: vmovdqa32 %zmm0, (%rsp)
				; AVX512BW-NEXT: vmovdqa32 {{[0-9]+}}(%rsp), %zmm1
				; AVX512BW-NEXT: vmovdqa32 {{[0-9]+}}(%rsp), %zmm3
				; AVX512BW-NEXT: vmovdqa32 {{[0-9]+}}(%rsp), %zmm2
				; AVX512BW-NEXT: addq $4, %rax
				; AVX512BW-NEXT: jne .LBB2_1
				; AVX512BW-NEXT: # BB#2: # %middle.block
				; AVX512BW-NEXT: vpaddd %zmm3, %zmm0, %zmm0
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm2, %zmm1
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vpunpckhqdq {{.*#+}} zmm1 = zmm0[1,1,3,3,5,5,7,7]
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: movl $1, %eax
				; AVX512BW-NEXT: vmovd %eax, %xmm1
				; AVX512BW-NEXT: vpermd %zmm0, %zmm1, %zmm1
				; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vmovd %xmm0, %eax
				; AVX512BW-NEXT: movq %rbp, %rsp
				; AVX512BW-NEXT: popq %rbp
				; AVX512BW-NEXT: retq
				entry:
				br label %vector.body

				vector.body:
				%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
				%vec.phi = phi <64 x i32> [ zeroinitializer, %entry ], [ %10, %vector.body ]
				%0 = getelementptr inbounds [1024 x i8], [1024 x i8]* @a, i64 0, i64 %index
				%1 = bitcast i8* %0 to <64 x i8>*
				%wide.load = load <64 x i8>, <64 x i8>* %1, align 64
				%2 = zext <64 x i8> %wide.load to <64 x i32>
				%3 = getelementptr inbounds [1024 x i8], [1024 x i8]* @b, i64 0, i64 %index
				%4 = bitcast i8* %3 to <64 x i8>*
				%wide.load1 = load <64 x i8>, <64 x i8>* %4, align 64
				%5 = zext <64 x i8> %wide.load1 to <64 x i32>
				%6 = sub nsw <64 x i32> %2, %5
				%7 = icmp sgt <64 x i32> %6, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%8 = sub nsw <64 x i32> zeroinitializer, %6
				%9 = select <64 x i1> %7, <64 x i32> %6, <64 x i32> %8
				%10 = add nsw <64 x i32> %9, %vec.phi
				%index.next = add i64 %index, 4
				%11 = icmp eq i64 %index.next, 1024
				br i1 %11, label %middle.block, label %vector.body

				middle.block:
				%.lcssa = phi <64 x i32> [ %10, %vector.body ]
				%rdx.shuf = shufflevector <64 x i32> %.lcssa, <64 x i32> undef, <64 x i32> <i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <64 x i32> %.lcssa, %rdx.shuf
				%rdx.shuf2 = shufflevector <64 x i32> %bin.rdx, <64 x i32> undef, <64 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx2 = add <64 x i32> %bin.rdx, %rdx.shuf2
				%rdx.shuf3 = shufflevector <64 x i32> %bin.rdx2, <64 x i32> undef, <64 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx3 = add <64 x i32> %bin.rdx2, %rdx.shuf3
				%rdx.shuf4 = shufflevector <64 x i32> %bin.rdx3, <64 x i32> undef, <64 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx4 = add <64 x i32> %bin.rdx3, %rdx.shuf4
				%rdx.shuf5 = shufflevector <64 x i32> %bin.rdx4, <64 x i32> undef, <64 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx5 = add <64 x i32> %bin.rdx4, %rdx.shuf5
				%rdx.shuf6 = shufflevector <64 x i32> %bin.rdx5, <64 x i32> undef, <64 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx6 = add <64 x i32> %bin.rdx5, %rdx.shuf6
				%12 = extractelement <64 x i32> %bin.rdx6, i32 0
				ret i32 %12
				}