This is an archive of the discontinued LLVM Phabricator instance.

[X86] Match PSADBW in straight-line code
ClosedPublic

Authored by mkuper on Jul 27 2016, 3:34 PM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
wmi

Commits

rGf396b4c40dad: [X86] Match PSADBW in straight-line code
rL277219: [X86] Match PSADBW in straight-line code

Summary

Up until now, we only had code to match PSADBW patterns similar to what comes out of the loop vectorizer - a partial reduction inside the loop body that gets fed into a horizontal operation in a different block.
This adds support for straight-line patterns, like those generated by the SLP vectorizer.

Diff Detail

Event Timeline

mkuper updated this revision to Diff 65819.Jul 27 2016, 3:34 PM

mkuper retitled this revision from to [X86] Match PSADBW in straight-line code.

mkuper updated this object.

mkuper added reviewers: spatel, RKSimon, wmi.

mkuper added a subscriber: llvm-commits.

wmi added a subscriber: danielcdh.Jul 27 2016, 4:07 PM

Is there much scope to share more of the code with combineLoopSADPattern?

lib/Target/X86/X86ISelLowering.cpp
26461	To support wider types is there any way that you can split the vector, perform PSAD on both and then combine the 2 results?
26500	Minor, but move this into the for() loop's initializer?
26529	Move into for loop

Thanks, Simon!

I tried to share as much code as I could, see https://reviews.llvm.org/rL276798 and https://reviews.llvm.org/rL276918 .
I don't see a way to share the "shuffle pyramid" detection code, unfortunately - for the loop case, we are forced to do it in IR, because the pyramid lives in a different basic block.
And the type checks in the beginning didn't look like they were worth it.

If you have suggestions for what I can factor out, let me know, I'll be happy to try.

lib/Target/X86/X86ISelLowering.cpp
26461	I think so. Right now we don't do this for the loop version either (I didn't write that code originally :-) ), and I kept the same constraint here. I'd prefer to leave it as a TODO, in both places.
26500	Sure (for both loops)

Another few minor thoughts, I'm happy if you'd prefer to just make these all future TODOs.

lib/Target/X86/X86ISelLowering.cpp
26450	Doesn't PSADBW generate unsigned i16 results? IIRC the horizontal sum of absdiff v16i8 / v32i8 / v64i8 should fit into a single i16 correct? Really there's nothing stopping us supporting any result integer type >= 16-bits no?
26461	OK - a TODO comment is fine.
26481	This looks like it could be useful for general matching of reduction/horizontal ops - possibly pull it out? I have in mind detecting any_of/all_of tests for vector comparison results that I'd like to use MOVMSK for instead - in that case it'd be the same code but we'd match against OR / AND instead of ADD.

mkuper added inline comments.Jul 29 2016, 10:09 AM

lib/Target/X86/X86ISelLowering.cpp
26450	You're right, it's another constraint the original loop code had, and removing it seems trivial. I'll add a TODO and fix this in a separate patch.
26481	Sure, I'll pull it out. TBH, I wouldn't be surprised if we already had something similar, but I didn't see it. There's isHorizontalBinOp(), but it's different - it matches the x86 "horizontal" operations (like PHADDD).

Updated per Simon's comments.

LGTM

This revision is now accepted and ready to land.Jul 29 2016, 2:28 PM

Closed by commit rL277219: [X86] Match PSADBW in straight-line code (authored by mkuper). · Explain WhyJul 29 2016, 2:53 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

118 lines

test/

CodeGen/

X86/

sad.ll

309 lines

Diff 65819

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 26,432 Lines • ▼ Show 20 Lines	static SDValue createPSADBW(SelectionDAG &DAG, const SDValue &Zext0,
Ops[0] = Zext1.getOperand(0);		Ops[0] = Zext1.getOperand(0);
SDValue SadOp1 = DAG.getNode(ISD::CONCAT_VECTORS, DL, ExtendedVT, Ops);		SDValue SadOp1 = DAG.getNode(ISD::CONCAT_VECTORS, DL, ExtendedVT, Ops);

// Actually build the SAD		// Actually build the SAD
MVT SadVT = MVT::getVectorVT(MVT::i64, RegSize / 64);		MVT SadVT = MVT::getVectorVT(MVT::i64, RegSize / 64);
return DAG.getNode(X86ISD::PSADBW, DL, SadVT, SadOp0, SadOp1);		return DAG.getNode(X86ISD::PSADBW, DL, SadVT, SadOp0, SadOp1);
}		}

		static SDValue combineBasicSADPattern(SDNode *Extract, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		// PSADBW is only supported on SSE2 and up.
		if (!Subtarget.hasSSE2())
		return SDValue();

		// Verify the type we're extracting from is appropriate
		EVT VT = Extract->getOperand(0).getValueType();
		if (!VT.isSimple() \|\| !(VT.getVectorElementType() == MVT::i32))
		return SDValue();
		RKSimonUnsubmitted Not Done Reply Inline Actions Doesn't PSADBW generate unsigned i16 results? IIRC the horizontal sum of absdiff v16i8 / v32i8 / v64i8 should fit into a single i16 correct? Really there's nothing stopping us supporting any result integer type >= 16-bits no? RKSimon: Doesn't PSADBW generate unsigned i16 results? IIRC the horizontal sum of absdiff v16i8 / v32i8…
		mkuperAuthorUnsubmitted Not Done Reply Inline Actions You're right, it's another constraint the original loop code had, and removing it seems trivial. I'll add a TODO and fix this in a separate patch. mkuper: You're right, it's another constraint the original loop code had, and removing it seems trivial.

		unsigned RegSize = 128;
		if (Subtarget.hasBWI())
		RegSize = 512;
		else if (Subtarget.hasAVX2())
		RegSize = 256;

		// We only handle v16i32 for SSE2 / v32i32 for AVX2 / v64i32 for AVX512.
		if (VT.getSizeInBits() / 4 > RegSize)
		return SDValue();

		RKSimonUnsubmitted Not Done Reply Inline Actions To support wider types is there any way that you can split the vector, perform PSAD on both and then combine the 2 results? RKSimon: To support wider types is there any way that you can split the vector, perform PSAD on both and…
		mkuperAuthorUnsubmitted Not Done Reply Inline Actions I think so. Right now we don't do this for the loop version either (I didn't write that code originally :-) ), and I kept the same constraint here. I'd prefer to leave it as a TODO, in both places. mkuper: I think so. Right now we don't do this for the loop version either (I didn't write that code…
		RKSimonUnsubmitted Not Done Reply Inline Actions OK - a TODO comment is fine. RKSimon: OK - a TODO comment is fine.
		// Verify the extract is from index 0.
		if (!isNullConstant(Extract->getOperand(1)))
		return SDValue();

		// Match shuffle + add pyramid.
		unsigned Elems = VT.getVectorNumElements();
		unsigned Stages = Log2_32(Elems);

		SDValue Op = Extract->getOperand(0);
		// At each stage, we're looking for something that looks like:
		// %s = shufflevector <8 x i32> %op, <8 x i32> undef,
		// <8 x i32> <i32 2, i32 3, i32 undef, i32 undef,
		// i32 undef, i32 undef, i32 undef, i32 undef>
		// %a = add <8 x i32> %op, %s
		// Where the mask changes according to the stage. E.g. for a 3-stage pyramid,
		// we expect something like:
		// <4,5,6,7,u,u,u,u>
		// <2,3,u,u,u,u,u,u>
		// <1,u,u,u,u,u,u,u>
		for (unsigned i = 0; i < Stages; ++i) {
		RKSimonUnsubmitted Not Done Reply Inline Actions This looks like it could be useful for general matching of reduction/horizontal ops - possibly pull it out? I have in mind detecting any_of/all_of tests for vector comparison results that I'd like to use MOVMSK for instead - in that case it'd be the same code but we'd match against OR / AND instead of ADD. RKSimon: This looks like it could be useful for general matching of reduction/horizontal ops - possibly…
		mkuperAuthorUnsubmitted Not Done Reply Inline Actions Sure, I'll pull it out. TBH, I wouldn't be surprised if we already had something similar, but I didn't see it. There's isHorizontalBinOp(), but it's different - it matches the x86 "horizontal" operations (like PHADDD). mkuper: Sure, I'll pull it out. TBH, I wouldn't be surprised if we already had something similar, but…
		if (Op.getOpcode() != ISD::ADD)
		return SDValue();

		ShuffleVectorSDNode *Shuffle =
		dyn_cast<ShuffleVectorSDNode>(Op.getOperand(0).getNode());
		if (Shuffle) {
		Op = Op.getOperand(1);
		} else {
		Shuffle = dyn_cast<ShuffleVectorSDNode>(Op.getOperand(1).getNode());
		Op = Op.getOperand(0);
		}

		// The first operand of the shuffle should be the same as the other operand
		// of the add.
		if (!Shuffle \|\| (Shuffle->getOperand(0) != Op))
		return SDValue();

		// Verify the shuffle has the expected (at this stage of the pyramid) mask.
		int MaskEnd = 1 << i;
		RKSimonUnsubmitted Not Done Reply Inline Actions Minor, but move this into the for() loop's initializer? RKSimon: Minor, but move this into the for() loop's initializer?
		mkuperAuthorUnsubmitted Not Done Reply Inline Actions Sure (for both loops) mkuper: Sure (for both loops)
		for (int Index = 0; Index < MaskEnd; ++Index)
		if (Shuffle->getMaskElt(Index) != MaskEnd + Index)
		return SDValue();
		}

		// At this point, we expect Op to be a select that is the root of an
		// abs-diff pattern.
		if (Op.getOpcode() != ISD::VSELECT)
		return SDValue();

		// Check whether we have an abs-diff pattern feeding into the select.
		SDValue Zext0, Zext1;
		if (!detectZextAbsDiff(Op, Zext0, Zext1))
		return SDValue();

		// Create the SAD instruction
		SDLoc DL(Extract);
		SDValue SAD = createPSADBW(DAG, Zext0, Zext1, DL);

		// If the original vector was wider than 8 elements, sum over the results
		// in the SAD vector.
		MVT SadVT = SAD.getSimpleValueType();
		if (Stages > 3) {
		unsigned SadElems = SadVT.getVectorNumElements();

		for(unsigned i = Stages - 3; i > 0; --i) {
		SmallVector<int, 16> Mask(SadElems, -1);
		unsigned MaskEnd = 1 << (i - 1);
		for(unsigned j = 0; j < MaskEnd; ++j)
		RKSimonUnsubmitted Not Done Reply Inline Actions Move into for loop RKSimon: Move into for loop
		Mask[j] = MaskEnd + j;

		SDValue Shuffle =
		DAG.getVectorShuffle(SadVT, DL, SAD, DAG.getUNDEF(SadVT), Mask);
		SAD = DAG.getNode(ISD::ADD, DL, SadVT, SAD, Shuffle);
		}
		}


		// Return the lowest i32.
		MVT ResVT = MVT::getVectorVT(MVT::i32, SadVT.getSizeInBits() / 32);
		SAD = DAG.getNode(ISD::BITCAST, DL, ResVT, SAD);
		return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i32, SAD,
		Extract->getOperand(1));
		}

/// Detect vector gather/scatter index generation and convert it from being a		/// Detect vector gather/scatter index generation and convert it from being a
/// bunch of shuffles and extracts into a somewhat faster sequence.		/// bunch of shuffles and extracts into a somewhat faster sequence.
/// For i686, the best sequence is apparently storing the value and loading		/// For i686, the best sequence is apparently storing the value and loading
/// scalars back, while for x64 we should use 64-bit extracts and shifts.		/// scalars back, while for x64 we should use 64-bit extracts and shifts.
static SDValue combineExtractVectorElt(SDNode *N, SelectionDAG &DAG,		static SDValue combineExtractVectorElt(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI) {		TargetLowering::DAGCombinerInfo &DCI,
		const X86Subtarget &Subtarget) {
if (SDValue NewOp = XFormVExtractWithShuffleIntoLoad(N, DAG, DCI))		if (SDValue NewOp = XFormVExtractWithShuffleIntoLoad(N, DAG, DCI))
return NewOp;		return NewOp;

SDValue InputVector = N->getOperand(0);		SDValue InputVector = N->getOperand(0);
SDLoc dl(InputVector);		SDLoc dl(InputVector);
// Detect mmx to i32 conversion through a v2i32 elt extract.		// Detect mmx to i32 conversion through a v2i32 elt extract.
if (InputVector.getOpcode() == ISD::BITCAST && InputVector.hasOneUse() &&		if (InputVector.getOpcode() == ISD::BITCAST && InputVector.hasOneUse() &&
N->getValueType(0) == MVT::i32 &&		N->getValueType(0) == MVT::i32 &&
Show All 14 Lines	if (VT == MVT::i1 && isa<ConstantSDNode>(N->getOperand(1)) &&
isa<ConstantSDNode>(InputVector.getOperand(0))) {		isa<ConstantSDNode>(InputVector.getOperand(0))) {
uint64_t ExtractedElt =		uint64_t ExtractedElt =
cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();		cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
uint64_t InputValue =		uint64_t InputValue =
cast<ConstantSDNode>(InputVector.getOperand(0))->getZExtValue();		cast<ConstantSDNode>(InputVector.getOperand(0))->getZExtValue();
uint64_t Res = (InputValue >> ExtractedElt) & 1;		uint64_t Res = (InputValue >> ExtractedElt) & 1;
return DAG.getConstant(Res, dl, MVT::i1);		return DAG.getConstant(Res, dl, MVT::i1);
}		}

		// Check whether this extract is the root of a sum of absolute differences
		// pattern. This has to be done here because we really want it to happen
		// pre-legalization,
		if (SDValue SAD = combineBasicSADPattern(N, DAG, Subtarget))
		return SAD;

// Only operate on vectors of 4 elements, where the alternative shuffling		// Only operate on vectors of 4 elements, where the alternative shuffling
// gets to be more expensive.		// gets to be more expensive.
if (InputVector.getValueType() != MVT::v4i32)		if (InputVector.getValueType() != MVT::v4i32)
return SDValue();		return SDValue();

// Check whether every use of InputVector is an EXTRACT_VECTOR_ELT with a		// Check whether every use of InputVector is an EXTRACT_VECTOR_ELT with a
// single use which is a sign-extend or zero-extend, and all elements are		// single use which is a sign-extend or zero-extend, and all elements are
// used.		// used.
▲ Show 20 Lines • Show All 4,525 Lines • ▼ Show 20 Lines
}		}


SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,		SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
switch (N->getOpcode()) {		switch (N->getOpcode()) {
default: break;		default: break;
case ISD::EXTRACT_VECTOR_ELT: return combineExtractVectorElt(N, DAG, DCI);		case ISD::EXTRACT_VECTOR_ELT:
		return combineExtractVectorElt(N, DAG, DCI, Subtarget);
case ISD::VSELECT:		case ISD::VSELECT:
case ISD::SELECT:		case ISD::SELECT:
case X86ISD::SHRUNKBLEND: return combineSelect(N, DAG, DCI, Subtarget);		case X86ISD::SHRUNKBLEND: return combineSelect(N, DAG, DCI, Subtarget);
case ISD::BITCAST: return combineBitcast(N, DAG, Subtarget);		case ISD::BITCAST: return combineBitcast(N, DAG, Subtarget);
case X86ISD::CMOV: return combineCMov(N, DAG, DCI, Subtarget);		case X86ISD::CMOV: return combineCMov(N, DAG, DCI, Subtarget);
case ISD::ADD: return combineAdd(N, DAG, Subtarget);		case ISD::ADD: return combineAdd(N, DAG, Subtarget);
case ISD::SUB: return combineSub(N, DAG, Subtarget);		case ISD::SUB: return combineSub(N, DAG, Subtarget);
case X86ISD::ADC: return combineADC(N, DAG, DCI);		case X86ISD::ADC: return combineADC(N, DAG, DCI);
▲ Show 20 Lines • Show All 969 Lines • Show Last 20 Lines

test/CodeGen/X86/sad.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; NOTE: Assertions have been autogenerated by update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 \| FileCheck %s --check-prefix=SSE2			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 \| FileCheck %s --check-prefix=SSE2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 \| FileCheck %s --check-prefix=AVX2			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 \| FileCheck %s --check-prefix=AVX2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f \| FileCheck %s --check-prefix=AVX512F			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f \| FileCheck %s --check-prefix=AVX512F
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw \| FileCheck %s --check-prefix=AVX512BW			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw \| FileCheck %s --check-prefix=AVX512BW

	@a = global [1024 x i8] zeroinitializer, align 16			@a = global [1024 x i8] zeroinitializer, align 16
	@b = global [1024 x i8] zeroinitializer, align 16			@b = global [1024 x i8] zeroinitializer, align 16

	▲ Show 20 Lines • Show All 983 Lines • ▼ Show 20 Lines
	middle.block:			middle.block:
	%.lcssa = phi <2 x i32> [ %10, %vector.body ]			%.lcssa = phi <2 x i32> [ %10, %vector.body ]
	%rdx.shuf = shufflevector <2 x i32> %.lcssa, <2 x i32> undef, <2 x i32> <i32 1, i32 undef>			%rdx.shuf = shufflevector <2 x i32> %.lcssa, <2 x i32> undef, <2 x i32> <i32 1, i32 undef>
	%bin.rdx = add <2 x i32> %.lcssa, %rdx.shuf			%bin.rdx = add <2 x i32> %.lcssa, %rdx.shuf
	%12 = extractelement <2 x i32> %bin.rdx, i32 0			%12 = extractelement <2 x i32> %bin.rdx, i32 0
	ret i32 %12			ret i32 %12
	}			}

				define i32 @sad_nonloop_4i8(<4 x i8>* nocapture readonly %p, i64, <4 x i8>* nocapture readonly %q) local_unnamed_addr #0 {
				; SSE2-LABEL: sad_nonloop_4i8:
				; SSE2: # BB#0:
				; SSE2-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; SSE2-NEXT: psadbw %xmm0, %xmm1
				; SSE2-NEXT: movd %xmm1, %eax
				; SSE2-NEXT: retq
				;
				; AVX2-LABEL: sad_nonloop_4i8:
				; AVX2: # BB#0:
				; AVX2-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; AVX2-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; AVX2-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX2-NEXT: vmovd %xmm0, %eax
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: sad_nonloop_4i8:
				; AVX512F: # BB#0:
				; AVX512F-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; AVX512F-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; AVX512F-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX512F-NEXT: vmovd %xmm0, %eax
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: sad_nonloop_4i8:
				; AVX512BW: # BB#0:
				; AVX512BW-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; AVX512BW-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; AVX512BW-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX512BW-NEXT: vmovd %xmm0, %eax
				; AVX512BW-NEXT: retq
				%v1 = load <4 x i8>, <4 x i8>* %p, align 1
				%z1 = zext <4 x i8> %v1 to <4 x i32>
				%v2 = load <4 x i8>, <4 x i8>* %q, align 1
				%z2 = zext <4 x i8> %v2 to <4 x i32>
				%sub = sub nsw <4 x i32> %z1, %z2
				%isneg = icmp sgt <4 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1>
				%neg = sub nsw <4 x i32> zeroinitializer, %sub
				%abs = select <4 x i1> %isneg, <4 x i32> %sub, <4 x i32> %neg
				%h2 = shufflevector <4 x i32> %abs, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
				%sum2 = add <4 x i32> %abs, %h2
				%h3 = shufflevector <4 x i32> %sum2, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
				%sum3 = add <4 x i32> %sum2, %h3
				%sum = extractelement <4 x i32> %sum3, i32 0
				ret i32 %sum
				}

				define i32 @sad_nonloop_8i8(<8 x i8>* nocapture readonly %p, i64, <8 x i8>* nocapture readonly %q) local_unnamed_addr #0 {
				; SSE2-LABEL: sad_nonloop_8i8:
				; SSE2: # BB#0:
				; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
				; SSE2-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
				; SSE2-NEXT: psadbw %xmm0, %xmm1
				; SSE2-NEXT: movd %xmm1, %eax
				; SSE2-NEXT: retq
				;
				; AVX2-LABEL: sad_nonloop_8i8:
				; AVX2: # BB#0:
				; AVX2-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX2-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVX2-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX2-NEXT: vmovd %xmm0, %eax
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: sad_nonloop_8i8:
				; AVX512F: # BB#0:
				; AVX512F-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX512F-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVX512F-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX512F-NEXT: vmovd %xmm0, %eax
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: sad_nonloop_8i8:
				; AVX512BW: # BB#0:
				; AVX512BW-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX512BW-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVX512BW-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX512BW-NEXT: vmovd %xmm0, %eax
				; AVX512BW-NEXT: retq
				%v1 = load <8 x i8>, <8 x i8>* %p, align 1
				%z1 = zext <8 x i8> %v1 to <8 x i32>
				%v2 = load <8 x i8>, <8 x i8>* %q, align 1
				%z2 = zext <8 x i8> %v2 to <8 x i32>
				%sub = sub nsw <8 x i32> %z1, %z2
				%isneg = icmp sgt <8 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%neg = sub nsw <8 x i32> zeroinitializer, %sub
				%abs = select <8 x i1> %isneg, <8 x i32> %sub, <8 x i32> %neg
				%h1 = shufflevector <8 x i32> %abs, <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum1 = add <8 x i32> %abs, %h1
				%h2 = shufflevector <8 x i32> %sum1, <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum2 = add <8 x i32> %sum1, %h2
				%h3 = shufflevector <8 x i32> %sum2, <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum3 = add <8 x i32> %sum2, %h3
				%sum = extractelement <8 x i32> %sum3, i32 0
				ret i32 %sum
				}

				define i32 @sad_nonloop_16i8(<16 x i8>* nocapture readonly %p, i64, <16 x i8>* nocapture readonly %q) local_unnamed_addr #0 {
				; SSE2-LABEL: sad_nonloop_16i8:
				; SSE2: # BB#0:
				; SSE2-NEXT: movdqu (%rdi), %xmm0
				; SSE2-NEXT: movdqu (%rdx), %xmm1
				; SSE2-NEXT: psadbw %xmm0, %xmm1
				; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]
				; SSE2-NEXT: paddq %xmm1, %xmm0
				; SSE2-NEXT: movd %xmm0, %eax
				; SSE2-NEXT: retq
				;
				; AVX2-LABEL: sad_nonloop_16i8:
				; AVX2: # BB#0:
				; AVX2-NEXT: vmovdqu (%rdi), %xmm0
				; AVX2-NEXT: vpsadbw (%rdx), %xmm0, %xmm0
				; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX2-NEXT: vpaddq %xmm1, %xmm0, %xmm0
				; AVX2-NEXT: vmovd %xmm0, %eax
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: sad_nonloop_16i8:
				; AVX512F: # BB#0:
				; AVX512F-NEXT: vmovdqu (%rdi), %xmm0
				; AVX512F-NEXT: vpsadbw (%rdx), %xmm0, %xmm0
				; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX512F-NEXT: vpaddq %xmm1, %xmm0, %xmm0
				; AVX512F-NEXT: vmovd %xmm0, %eax
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: sad_nonloop_16i8:
				; AVX512BW: # BB#0:
				; AVX512BW-NEXT: vmovdqu (%rdi), %xmm0
				; AVX512BW-NEXT: vpsadbw (%rdx), %xmm0, %xmm0
				; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX512BW-NEXT: vpaddq %xmm1, %xmm0, %xmm0
				; AVX512BW-NEXT: vmovd %xmm0, %eax
				; AVX512BW-NEXT: retq
				%v1 = load <16 x i8>, <16 x i8>* %p, align 1
				%z1 = zext <16 x i8> %v1 to <16 x i32>
				%v2 = load <16 x i8>, <16 x i8>* %q, align 1
				%z2 = zext <16 x i8> %v2 to <16 x i32>
				%sub = sub nsw <16 x i32> %z1, %z2
				%isneg = icmp sgt <16 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%neg = sub nsw <16 x i32> zeroinitializer, %sub
				%abs = select <16 x i1> %isneg, <16 x i32> %sub, <16 x i32> %neg
				%h0 = shufflevector <16 x i32> %abs, <16 x i32> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum0 = add <16 x i32> %abs, %h0
				%h1 = shufflevector <16 x i32> %sum0, <16 x i32> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum1 = add <16 x i32> %sum0, %h1
				%h2 = shufflevector <16 x i32> %sum1, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum2 = add <16 x i32> %sum1, %h2
				%h3 = shufflevector <16 x i32> %sum2, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum3 = add <16 x i32> %sum2, %h3
				%sum = extractelement <16 x i32> %sum3, i32 0
				ret i32 %sum
				}

				define i32 @sad_nonloop_32i8(<32 x i8>* nocapture readonly %p, i64, <32 x i8>* nocapture readonly %q) local_unnamed_addr #0 {
				; SSE2-LABEL: sad_nonloop_32i8:
				; SSE2: # BB#0:
				; SSE2-NEXT: movdqu (%rdi), %xmm12
				; SSE2-NEXT: movdqu 16(%rdi), %xmm2
				; SSE2-NEXT: pshufd {{.*#+}} xmm13 = xmm2[2,3,0,1]
				; SSE2-NEXT: pxor %xmm5, %xmm5
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm13 = xmm13[0],xmm5[0],xmm13[1],xmm5[1],xmm13[2],xmm5[2],xmm13[3],xmm5[3],xmm13[4],xmm5[4],xmm13[5],xmm5[5],xmm13[6],xmm5[6],xmm13[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm13, %xmm9
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm9 = xmm9[4],xmm5[4],xmm9[5],xmm5[5],xmm9[6],xmm5[6],xmm9[7],xmm5[7]
				; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm12[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm5[0],xmm1[1],xmm5[1],xmm1[2],xmm5[2],xmm1[3],xmm5[3],xmm1[4],xmm5[4],xmm1[5],xmm5[5],xmm1[6],xmm5[6],xmm1[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm1, %xmm3
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm5[4],xmm3[5],xmm5[5],xmm3[6],xmm5[6],xmm3[7],xmm5[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm5[0],xmm2[1],xmm5[1],xmm2[2],xmm5[2],xmm2[3],xmm5[3],xmm2[4],xmm5[4],xmm2[5],xmm5[5],xmm2[6],xmm5[6],xmm2[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm2, %xmm10
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm10 = xmm10[4],xmm5[4],xmm10[5],xmm5[5],xmm10[6],xmm5[6],xmm10[7],xmm5[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm12 = xmm12[0],xmm5[0],xmm12[1],xmm5[1],xmm12[2],xmm5[2],xmm12[3],xmm5[3],xmm12[4],xmm5[4],xmm12[5],xmm5[5],xmm12[6],xmm5[6],xmm12[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm12, %xmm11
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm5[4],xmm11[5],xmm5[5],xmm11[6],xmm5[6],xmm11[7],xmm5[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm13 = xmm13[0],xmm5[0],xmm13[1],xmm5[1],xmm13[2],xmm5[2],xmm13[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm5[0],xmm1[1],xmm5[1],xmm1[2],xmm5[2],xmm1[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm5[0],xmm2[1],xmm5[1],xmm2[2],xmm5[2],xmm2[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm12 = xmm12[0],xmm5[0],xmm12[1],xmm5[1],xmm12[2],xmm5[2],xmm12[3],xmm5[3]
				; SSE2-NEXT: movdqu (%rdx), %xmm7
				; SSE2-NEXT: movdqu 16(%rdx), %xmm0
				; SSE2-NEXT: pshufd {{.*#+}} xmm6 = xmm0[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm6 = xmm6[0],xmm5[0],xmm6[1],xmm5[1],xmm6[2],xmm5[2],xmm6[3],xmm5[3],xmm6[4],xmm5[4],xmm6[5],xmm5[5],xmm6[6],xmm5[6],xmm6[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm6, %xmm4
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm5[4],xmm4[5],xmm5[5],xmm4[6],xmm5[6],xmm4[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
				; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm7[2,3,0,1]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3],xmm4[4],xmm5[4],xmm4[5],xmm5[5],xmm4[6],xmm5[6],xmm4[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm4, %xmm14
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm14 = xmm14[4],xmm5[4],xmm14[5],xmm5[5],xmm14[6],xmm5[6],xmm14[7],xmm5[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm5[0],xmm0[1],xmm5[1],xmm0[2],xmm5[2],xmm0[3],xmm5[3],xmm0[4],xmm5[4],xmm0[5],xmm5[5],xmm0[6],xmm5[6],xmm0[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm0, %xmm15
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm15 = xmm15[4],xmm5[4],xmm15[5],xmm5[5],xmm15[6],xmm5[6],xmm15[7],xmm5[7]
				; SSE2-NEXT: punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm5[0],xmm7[1],xmm5[1],xmm7[2],xmm5[2],xmm7[3],xmm5[3],xmm7[4],xmm5[4],xmm7[5],xmm5[5],xmm7[6],xmm5[6],xmm7[7],xmm5[7]
				; SSE2-NEXT: movdqa %xmm7, %xmm8
				; SSE2-NEXT: punpckhwd {{.*#+}} xmm8 = xmm8[4],xmm5[4],xmm8[5],xmm5[5],xmm8[6],xmm5[6],xmm8[7],xmm5[7]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm5[0],xmm6[1],xmm5[1],xmm6[2],xmm5[2],xmm6[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm5[0],xmm0[1],xmm5[1],xmm0[2],xmm5[2],xmm0[3],xmm5[3]
				; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm5[0],xmm7[1],xmm5[1],xmm7[2],xmm5[2],xmm7[3],xmm5[3]
				; SSE2-NEXT: psubd %xmm7, %xmm12
				; SSE2-NEXT: psubd %xmm0, %xmm2
				; SSE2-NEXT: psubd %xmm4, %xmm1
				; SSE2-NEXT: psubd %xmm6, %xmm13
				; SSE2-NEXT: psubd %xmm8, %xmm11
				; SSE2-NEXT: psubd %xmm15, %xmm10
				; SSE2-NEXT: psubd %xmm14, %xmm3
				; SSE2-NEXT: psubd -{{[0-9]+}}(%rsp), %xmm9 # 16-byte Folded Reload
				; SSE2-NEXT: movdqa %xmm9, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm9
				; SSE2-NEXT: pxor %xmm0, %xmm9
				; SSE2-NEXT: movdqa %xmm3, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm3
				; SSE2-NEXT: pxor %xmm0, %xmm3
				; SSE2-NEXT: movdqa %xmm10, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm10
				; SSE2-NEXT: pxor %xmm0, %xmm10
				; SSE2-NEXT: movdqa %xmm11, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm11
				; SSE2-NEXT: pxor %xmm0, %xmm11
				; SSE2-NEXT: movdqa %xmm13, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm13
				; SSE2-NEXT: pxor %xmm0, %xmm13
				; SSE2-NEXT: movdqa %xmm1, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm1
				; SSE2-NEXT: pxor %xmm0, %xmm1
				; SSE2-NEXT: movdqa %xmm2, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm2
				; SSE2-NEXT: pxor %xmm0, %xmm2
				; SSE2-NEXT: movdqa %xmm12, %xmm0
				; SSE2-NEXT: psrad $31, %xmm0
				; SSE2-NEXT: paddd %xmm0, %xmm12
				; SSE2-NEXT: pxor %xmm0, %xmm12
				; SSE2-NEXT: paddd %xmm13, %xmm1
				; SSE2-NEXT: paddd %xmm9, %xmm3
				; SSE2-NEXT: paddd %xmm10, %xmm3
				; SSE2-NEXT: paddd %xmm11, %xmm3
				; SSE2-NEXT: paddd %xmm2, %xmm1
				; SSE2-NEXT: paddd %xmm3, %xmm1
				; SSE2-NEXT: paddd %xmm12, %xmm1
				; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]
				; SSE2-NEXT: paddd %xmm1, %xmm0
				; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
				; SSE2-NEXT: paddd %xmm0, %xmm1
				; SSE2-NEXT: movd %xmm1, %eax
				; SSE2-NEXT: retq
				;
				; AVX2-LABEL: sad_nonloop_32i8:
				; AVX2: # BB#0:
				; AVX2-NEXT: vmovdqu (%rdi), %ymm0
				; AVX2-NEXT: vpsadbw (%rdx), %ymm0, %ymm0
				; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX2-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX2-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vmovd %xmm0, %eax
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: sad_nonloop_32i8:
				; AVX512F: # BB#0:
				; AVX512F-NEXT: vmovdqu (%rdi), %ymm0
				; AVX512F-NEXT: vpsadbw (%rdx), %ymm0, %ymm0
				; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX512F-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX512F-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX512F-NEXT: vmovd %xmm0, %eax
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: sad_nonloop_32i8:
				; AVX512BW: # BB#0:
				; AVX512BW-NEXT: vmovdqu (%rdi), %ymm0
				; AVX512BW-NEXT: vpsadbw (%rdx), %ymm0, %ymm0
				; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX512BW-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; AVX512BW-NEXT: vpaddq %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: vmovd %xmm0, %eax
				; AVX512BW-NEXT: retq
				%v1 = load <32 x i8>, <32 x i8>* %p, align 1
				%z1 = zext <32 x i8> %v1 to <32 x i32>
				%v2 = load <32 x i8>, <32 x i8>* %q, align 1
				%z2 = zext <32 x i8> %v2 to <32 x i32>
				%sub = sub nsw <32 x i32> %z1, %z2
				%isneg = icmp sgt <32 x i32> %sub, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%neg = sub nsw <32 x i32> zeroinitializer, %sub
				%abs = select <32 x i1> %isneg, <32 x i32> %sub, <32 x i32> %neg
				%h32 = shufflevector <32 x i32> %abs, <32 x i32> undef, <32 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum32 = add <32 x i32> %abs, %h32
				%h0 = shufflevector <32 x i32> %sum32, <32 x i32> undef, <32 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum0 = add <32 x i32> %sum32, %h0
				%h1 = shufflevector <32 x i32> %sum0, <32 x i32> undef, <32 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum1 = add <32 x i32> %sum0, %h1
				%h2 = shufflevector <32 x i32> %sum1, <32 x i32> undef, <32 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum2 = add <32 x i32> %sum1, %h2
				%h3 = shufflevector <32 x i32> %sum2, <32 x i32> undef, <32 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%sum3 = add <32 x i32> %sum2, %h3
				%sum = extractelement <32 x i32> %sum3, i32 0
				ret i32 %sum
				}