This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
setcc-wide-types.ll

Differential D41618

[x86] allow pairs of PCMPEQ for vector-sized integer equality comparisons (PR33325)
ClosedPublic

Authored by spatel on Dec 28 2017, 2:24 PM.

Download Raw Diff

Details

Reviewers

zvi
courbet
RKSimon

Commits

rG9a80871ffe65: [x86] allow pairs of PCMPEQ for vector-sized integer equality comparisons…
rL321656: [x86] allow pairs of PCMPEQ for vector-sized integer equality comparisons…

Summary

This is an extension of D31156 with the goal that we'll allow memcmp() == 0 expansion for x86 to use 2 pairs of loads per block.

The memcmp expansion pass (formerly part of CGP) will generate this kind of pattern with oversized integer compares, so we want to transform these into x86-specific vector nodes before legalization splits things into scalar chunks.

See PR33325 for more details:
https://bugs.llvm.org/show_bug.cgi?id=33325

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Dec 28 2017, 2:24 PM

Herald added a subscriber: mcrosier. · View Herald TranscriptDec 28 2017, 2:24 PM

RKSimon added inline comments.Dec 29 2017, 2:31 PM

test/CodeGen/X86/setcc-wide-types.ll
3 ↗	(On Diff #128305)	Please can you add AVX1/AVX512F/AVX512BW test cases to prove its not doing anything dodgy with those?

spatel mentioned this in rL321624: [x86] add runs for more vector variants; NFC.Jan 1 2018, 8:38 AM

Patch updated:

Fixed the check for 'IsOrXorXor'; we must confirm that the 2nd operand ('Y') of the setcc is actually a zero.
Added runs / checks for more ISA variations to show that we're behaving as expected in those cases. Note that we do not expect to produce 'i256' types via memcmp expansion for anything before AVX2, so I don't think we care about optimizing those cases (the last 2 tests just show the legal scalar code in that scenario).

cc @craig.topper and @hfinkel -- This might be an interesting case for the x86 preferred vector width effort (D41341) even for 128/256-bit vectors. Here, we're taking scalar code that is or would be produced by memcmp expansion and translating it to vectors based on the available ISA, but without accounting for the preferred vector width. I think we should add that predicate as a follow-up to prevent producing vector code if the user has requested we avoid it. There's no AVX512-specific codegen here or in memcmp expansion, so that's not a danger (yet).

courbet added inline comments.Jan 1 2018, 11:34 PM

lib/Target/X86/X86ISelLowering.cpp
36308 ↗	(On Diff #128389)	I would add a comment to explain when this is typically generated, else this feels a bit magic. Something along the lines of: "This pattern is typically generated by the memcmp expansion pass with oversized integer compares (see PR33325)."
36314 ↗	(On Diff #128389)	The `isNullConstant(Y)` is duplicated here with the definition of `IsOrXorXor`. Let's keep it inside `IsOrXorXor` ( and maybe rename to `IsOrXorXorCCZero` ?
36332 ↗	(On Diff #128389)	`eq`: this is a bit misleading on the first read. Change to `eq\|ne` or `cc` ?

spatel marked 3 inline comments as done.Jan 2 2018, 7:14 AM

spatel added inline comments.

lib/Target/X86/X86ISelLowering.cpp
36314 ↗	(On Diff #128389)	I agree with improving the variable name, but I don't see how we can simplify the logic unless we repeat the OrXorXor checks? We have: A && !(A && B) --> A && (!A \|\| !B) --> A && !B where A is isNullConstant and B is OrXorXor

Patch updated:
No functional difference from previous draft, but improved code comments and variable name.

courbet added inline comments.Jan 2 2018, 7:19 AM

lib/Target/X86/X86ISelLowering.cpp
36314 ↗	(On Diff #128389)	Never mind; I misread the condition. Let's just fix the variable name.

courbet accepted this revision.Jan 2 2018, 8:20 AM

This revision is now accepted and ready to land.Jan 2 2018, 8:20 AM

Closed by commit rL321656: [x86] allow pairs of PCMPEQ for vector-sized integer equality comparisons… (authored by spatel). · Explain WhyJan 2 2018, 8:39 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

38 lines

test/

CodeGen/

X86/

setcc-wide-types.ll

284 lines

Diff 128423

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 36,310 Lines • ▼ Show 20 Lines

	/// Try to map a 128-bit or larger integer comparison to vector instructions			/// Try to map a 128-bit or larger integer comparison to vector instructions
	/// before type legalization splits it up into chunks.			/// before type legalization splits it up into chunks.
	static SDValue combineVectorSizedSetCCEquality(SDNode *SetCC, SelectionDAG &DAG,			static SDValue combineVectorSizedSetCCEquality(SDNode *SetCC, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {			const X86Subtarget &Subtarget) {
	ISD::CondCode CC = cast<CondCodeSDNode>(SetCC->getOperand(2))->get();			ISD::CondCode CC = cast<CondCodeSDNode>(SetCC->getOperand(2))->get();
	assert((CC == ISD::SETNE \|\| CC == ISD::SETEQ) && "Bad comparison predicate");			assert((CC == ISD::SETNE \|\| CC == ISD::SETEQ) && "Bad comparison predicate");

	// We're looking for an oversized integer equality comparison, but ignore a			// We're looking for an oversized integer equality comparison.
	// comparison with zero because that gets special treatment in EmitTest().
	SDValue X = SetCC->getOperand(0);			SDValue X = SetCC->getOperand(0);
	SDValue Y = SetCC->getOperand(1);			SDValue Y = SetCC->getOperand(1);
	EVT OpVT = X.getValueType();			EVT OpVT = X.getValueType();
	unsigned OpSize = OpVT.getSizeInBits();			unsigned OpSize = OpVT.getSizeInBits();
	if (!OpVT.isScalarInteger() \|\| OpSize < 128 \|\| isNullConstant(Y))			if (!OpVT.isScalarInteger() \|\| OpSize < 128)
				return SDValue();

				// Ignore a comparison with zero because that gets special treatment in
				// EmitTest(). But make an exception for the special case of a pair of
				// logically-combined vector-sized operands compared to zero. This pattern may
				// be generated by the memcmp expansion pass with oversized integer compares
				// (see PR33325).
				bool IsOrXorXorCCZero = isNullConstant(Y) && X.getOpcode() == ISD::OR &&
				X.getOperand(0).getOpcode() == ISD::XOR &&
				X.getOperand(1).getOpcode() == ISD::XOR;
				if (isNullConstant(Y) && !IsOrXorXorCCZero)
	return SDValue();			return SDValue();

	// Bail out if we know that this is not really just an oversized integer.			// Bail out if we know that this is not really just an oversized integer.
	if (peekThroughBitcasts(X).getValueType() == MVT::f128 \|\|			if (peekThroughBitcasts(X).getValueType() == MVT::f128 \|\|
	peekThroughBitcasts(Y).getValueType() == MVT::f128)			peekThroughBitcasts(Y).getValueType() == MVT::f128)
	return SDValue();			return SDValue();

	// TODO: Use PXOR + PTEST for SSE4.1 or later?			// TODO: Use PXOR + PTEST for SSE4.1 or later?
	// TODO: Add support for AVX-512.			// TODO: Add support for AVX-512.
	EVT VT = SetCC->getValueType(0);			EVT VT = SetCC->getValueType(0);
	SDLoc DL(SetCC);			SDLoc DL(SetCC);
	if ((OpSize == 128 && Subtarget.hasSSE2()) \|\|			if ((OpSize == 128 && Subtarget.hasSSE2()) \|\|
	(OpSize == 256 && Subtarget.hasAVX2())) {			(OpSize == 256 && Subtarget.hasAVX2())) {
	EVT VecVT = OpSize == 128 ? MVT::v16i8 : MVT::v32i8;			EVT VecVT = OpSize == 128 ? MVT::v16i8 : MVT::v32i8;
				SDValue Cmp;
				if (IsOrXorXorCCZero) {
				// This is a bitwise-combined equality comparison of 2 pairs of vectors:
				// setcc i128 (or (xor A, B), (xor C, D)), 0, eq\|ne
				// Use 2 vector equality compares and 'and' the results before doing a
				// MOVMSK.
				SDValue A = DAG.getBitcast(VecVT, X.getOperand(0).getOperand(0));
				SDValue B = DAG.getBitcast(VecVT, X.getOperand(0).getOperand(1));
				SDValue C = DAG.getBitcast(VecVT, X.getOperand(1).getOperand(0));
				SDValue D = DAG.getBitcast(VecVT, X.getOperand(1).getOperand(1));
				SDValue Cmp1 = DAG.getNode(X86ISD::PCMPEQ, DL, VecVT, A, B);
				SDValue Cmp2 = DAG.getNode(X86ISD::PCMPEQ, DL, VecVT, C, D);
				Cmp = DAG.getNode(ISD::AND, DL, VecVT, Cmp1, Cmp2);
				} else {
	SDValue VecX = DAG.getBitcast(VecVT, X);			SDValue VecX = DAG.getBitcast(VecVT, X);
	SDValue VecY = DAG.getBitcast(VecVT, Y);			SDValue VecY = DAG.getBitcast(VecVT, Y);
				Cmp = DAG.getNode(X86ISD::PCMPEQ, DL, VecVT, VecX, VecY);
				}
	// If all bytes match (bitmask is 0x(FFFF)FFFF), that's equality.			// If all bytes match (bitmask is 0x(FFFF)FFFF), that's equality.
	// setcc i128 X, Y, eq --> setcc (pmovmskb (pcmpeqb X, Y)), 0xFFFF, eq			// setcc i128 X, Y, eq --> setcc (pmovmskb (pcmpeqb X, Y)), 0xFFFF, eq
	// setcc i128 X, Y, ne --> setcc (pmovmskb (pcmpeqb X, Y)), 0xFFFF, ne			// setcc i128 X, Y, ne --> setcc (pmovmskb (pcmpeqb X, Y)), 0xFFFF, ne
	// setcc i256 X, Y, eq --> setcc (vpmovmskb (vpcmpeqb X, Y)), 0xFFFFFFFF, eq			// setcc i256 X, Y, eq --> setcc (vpmovmskb (vpcmpeqb X, Y)), 0xFFFFFFFF, eq
	// setcc i256 X, Y, ne --> setcc (vpmovmskb (vpcmpeqb X, Y)), 0xFFFFFFFF, ne			// setcc i256 X, Y, ne --> setcc (vpmovmskb (vpcmpeqb X, Y)), 0xFFFFFFFF, ne
	SDValue Cmp = DAG.getNode(X86ISD::PCMPEQ, DL, VecVT, VecX, VecY);
	SDValue MovMsk = DAG.getNode(X86ISD::MOVMSK, DL, MVT::i32, Cmp);			SDValue MovMsk = DAG.getNode(X86ISD::MOVMSK, DL, MVT::i32, Cmp);
	SDValue FFFFs = DAG.getConstant(OpSize == 128 ? 0xFFFF : 0xFFFFFFFF, DL,			SDValue FFFFs = DAG.getConstant(OpSize == 128 ? 0xFFFF : 0xFFFFFFFF, DL,
	MVT::i32);			MVT::i32);
	return DAG.getSetCC(DL, VT, MovMsk, FFFFs, CC);			return DAG.getSetCC(DL, VT, MovMsk, FFFFs, CC);
	}			}

	return SDValue();			return SDValue();
	}			}
	▲ Show 20 Lines • Show All 2,334 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/setcc-wide-types.ll

Show First 20 Lines • Show All 187 Lines • ▼ Show 20 Lines	; AVX256-NEXT: retq
%zext = zext i1 %cmp to i32		%zext = zext i1 %cmp to i32
ret i32 %zext		ret i32 %zext
}		}

; This test models the expansion of 'memcmp(a, b, 32) != 0'		; This test models the expansion of 'memcmp(a, b, 32) != 0'
; if we allowed 2 pairs of 16-byte loads per block.		; if we allowed 2 pairs of 16-byte loads per block.

define i32 @ne_i128_pair(i128* %a, i128* %b) {		define i32 @ne_i128_pair(i128* %a, i128* %b) {
; ANY-LABEL: ne_i128_pair:		; SSE2-LABEL: ne_i128_pair:
; ANY: # %bb.0:		; SSE2: # %bb.0:
; ANY-NEXT: movq (%rdi), %rax		; SSE2-NEXT: movdqu (%rdi), %xmm0
; ANY-NEXT: movq 8(%rdi), %rcx		; SSE2-NEXT: movdqu 16(%rdi), %xmm1
; ANY-NEXT: xorq (%rsi), %rax		; SSE2-NEXT: movdqu (%rsi), %xmm2
; ANY-NEXT: xorq 8(%rsi), %rcx		; SSE2-NEXT: pcmpeqb %xmm0, %xmm2
; ANY-NEXT: movq 24(%rdi), %rdx		; SSE2-NEXT: movdqu 16(%rsi), %xmm0
; ANY-NEXT: movq 16(%rdi), %rdi		; SSE2-NEXT: pcmpeqb %xmm1, %xmm0
; ANY-NEXT: xorq 16(%rsi), %rdi		; SSE2-NEXT: pand %xmm2, %xmm0
; ANY-NEXT: orq %rax, %rdi		; SSE2-NEXT: pmovmskb %xmm0, %ecx
; ANY-NEXT: xorq 24(%rsi), %rdx		; SSE2-NEXT: xorl %eax, %eax
; ANY-NEXT: orq %rcx, %rdx		; SSE2-NEXT: cmpl $65535, %ecx # imm = 0xFFFF
; ANY-NEXT: xorl %eax, %eax		; SSE2-NEXT: setne %al
; ANY-NEXT: orq %rdi, %rdx		; SSE2-NEXT: retq
; ANY-NEXT: setne %al		;
; ANY-NEXT: retq		; AVXANY-LABEL: ne_i128_pair:
		; AVXANY: # %bb.0:
		; AVXANY-NEXT: vmovdqu (%rdi), %xmm0
		; AVXANY-NEXT: vmovdqu 16(%rdi), %xmm1
		; AVXANY-NEXT: vpcmpeqb 16(%rsi), %xmm1, %xmm1
		; AVXANY-NEXT: vpcmpeqb (%rsi), %xmm0, %xmm0
		; AVXANY-NEXT: vpand %xmm1, %xmm0, %xmm0
		; AVXANY-NEXT: vpmovmskb %xmm0, %ecx
		; AVXANY-NEXT: xorl %eax, %eax
		; AVXANY-NEXT: cmpl $65535, %ecx # imm = 0xFFFF
		; AVXANY-NEXT: setne %al
		; AVXANY-NEXT: retq
%a0 = load i128, i128* %a		%a0 = load i128, i128* %a
%b0 = load i128, i128* %b		%b0 = load i128, i128* %b
%xor1 = xor i128 %a0, %b0		%xor1 = xor i128 %a0, %b0
%ap1 = getelementptr i128, i128* %a, i128 1		%ap1 = getelementptr i128, i128* %a, i128 1
%bp1 = getelementptr i128, i128* %b, i128 1		%bp1 = getelementptr i128, i128* %b, i128 1
%a1 = load i128, i128* %ap1		%a1 = load i128, i128* %ap1
%b1 = load i128, i128* %bp1		%b1 = load i128, i128* %bp1
%xor2 = xor i128 %a1, %b1		%xor2 = xor i128 %a1, %b1
%or = or i128 %xor1, %xor2		%or = or i128 %xor1, %xor2
%cmp = icmp ne i128 %or, 0		%cmp = icmp ne i128 %or, 0
%z = zext i1 %cmp to i32		%z = zext i1 %cmp to i32
ret i32 %z		ret i32 %z
}		}

; This test models the expansion of 'memcmp(a, b, 32) == 0'		; This test models the expansion of 'memcmp(a, b, 32) == 0'
; if we allowed 2 pairs of 16-byte loads per block.		; if we allowed 2 pairs of 16-byte loads per block.

define i32 @eq_i128_pair(i128* %a, i128* %b) {		define i32 @eq_i128_pair(i128* %a, i128* %b) {
; ANY-LABEL: eq_i128_pair:		; SSE2-LABEL: eq_i128_pair:
; ANY: # %bb.0:		; SSE2: # %bb.0:
; ANY-NEXT: movq (%rdi), %rax		; SSE2-NEXT: movdqu (%rdi), %xmm0
; ANY-NEXT: movq 8(%rdi), %rcx		; SSE2-NEXT: movdqu 16(%rdi), %xmm1
; ANY-NEXT: xorq (%rsi), %rax		; SSE2-NEXT: movdqu (%rsi), %xmm2
; ANY-NEXT: xorq 8(%rsi), %rcx		; SSE2-NEXT: pcmpeqb %xmm0, %xmm2
; ANY-NEXT: movq 24(%rdi), %rdx		; SSE2-NEXT: movdqu 16(%rsi), %xmm0
; ANY-NEXT: movq 16(%rdi), %rdi		; SSE2-NEXT: pcmpeqb %xmm1, %xmm0
; ANY-NEXT: xorq 16(%rsi), %rdi		; SSE2-NEXT: pand %xmm2, %xmm0
; ANY-NEXT: orq %rax, %rdi		; SSE2-NEXT: pmovmskb %xmm0, %ecx
; ANY-NEXT: xorq 24(%rsi), %rdx		; SSE2-NEXT: xorl %eax, %eax
; ANY-NEXT: orq %rcx, %rdx		; SSE2-NEXT: cmpl $65535, %ecx # imm = 0xFFFF
; ANY-NEXT: xorl %eax, %eax		; SSE2-NEXT: sete %al
; ANY-NEXT: orq %rdi, %rdx		; SSE2-NEXT: retq
; ANY-NEXT: sete %al		;
; ANY-NEXT: retq		; AVXANY-LABEL: eq_i128_pair:
		; AVXANY: # %bb.0:
		; AVXANY-NEXT: vmovdqu (%rdi), %xmm0
		; AVXANY-NEXT: vmovdqu 16(%rdi), %xmm1
		; AVXANY-NEXT: vpcmpeqb 16(%rsi), %xmm1, %xmm1
		; AVXANY-NEXT: vpcmpeqb (%rsi), %xmm0, %xmm0
		; AVXANY-NEXT: vpand %xmm1, %xmm0, %xmm0
		; AVXANY-NEXT: vpmovmskb %xmm0, %ecx
		; AVXANY-NEXT: xorl %eax, %eax
		; AVXANY-NEXT: cmpl $65535, %ecx # imm = 0xFFFF
		; AVXANY-NEXT: sete %al
		; AVXANY-NEXT: retq
%a0 = load i128, i128* %a		%a0 = load i128, i128* %a
%b0 = load i128, i128* %b		%b0 = load i128, i128* %b
%xor1 = xor i128 %a0, %b0		%xor1 = xor i128 %a0, %b0
%ap1 = getelementptr i128, i128* %a, i128 1		%ap1 = getelementptr i128, i128* %a, i128 1
%bp1 = getelementptr i128, i128* %b, i128 1		%bp1 = getelementptr i128, i128* %b, i128 1
%a1 = load i128, i128* %ap1		%a1 = load i128, i128* %ap1
%b1 = load i128, i128* %bp1		%b1 = load i128, i128* %bp1
%xor2 = xor i128 %a1, %b1		%xor2 = xor i128 %a1, %b1
%or = or i128 %xor1, %xor2		%or = or i128 %xor1, %xor2
%cmp = icmp eq i128 %or, 0		%cmp = icmp eq i128 %or, 0
%z = zext i1 %cmp to i32		%z = zext i1 %cmp to i32
ret i32 %z		ret i32 %z
}		}

; This test models the expansion of 'memcmp(a, b, 64) != 0'		; This test models the expansion of 'memcmp(a, b, 64) != 0'
; if we allowed 2 pairs of 32-byte loads per block.		; if we allowed 2 pairs of 32-byte loads per block.

define i32 @ne_i256_pair(i256* %a, i256* %b) {		define i32 @ne_i256_pair(i256* %a, i256* %b) {
; ANY-LABEL: ne_i256_pair:		; SSE2-LABEL: ne_i256_pair:
; ANY: # %bb.0:		; SSE2: # %bb.0:
; ANY-NEXT: movq 16(%rdi), %r9		; SSE2-NEXT: movq 16(%rdi), %r9
; ANY-NEXT: movq 24(%rdi), %r11		; SSE2-NEXT: movq 24(%rdi), %r11
; ANY-NEXT: movq (%rdi), %r8		; SSE2-NEXT: movq (%rdi), %r8
; ANY-NEXT: movq 8(%rdi), %r10		; SSE2-NEXT: movq 8(%rdi), %r10
; ANY-NEXT: xorq 8(%rsi), %r10		; SSE2-NEXT: xorq 8(%rsi), %r10
; ANY-NEXT: xorq 24(%rsi), %r11		; SSE2-NEXT: xorq 24(%rsi), %r11
; ANY-NEXT: xorq (%rsi), %r8		; SSE2-NEXT: xorq (%rsi), %r8
; ANY-NEXT: xorq 16(%rsi), %r9		; SSE2-NEXT: xorq 16(%rsi), %r9
; ANY-NEXT: movq 48(%rdi), %rdx		; SSE2-NEXT: movq 48(%rdi), %rdx
; ANY-NEXT: movq 32(%rdi), %rax		; SSE2-NEXT: movq 32(%rdi), %rax
; ANY-NEXT: movq 56(%rdi), %rcx		; SSE2-NEXT: movq 56(%rdi), %rcx
; ANY-NEXT: movq 40(%rdi), %rdi		; SSE2-NEXT: movq 40(%rdi), %rdi
; ANY-NEXT: xorq 40(%rsi), %rdi		; SSE2-NEXT: xorq 40(%rsi), %rdi
; ANY-NEXT: xorq 56(%rsi), %rcx		; SSE2-NEXT: xorq 56(%rsi), %rcx
; ANY-NEXT: orq %r11, %rcx		; SSE2-NEXT: orq %r11, %rcx
; ANY-NEXT: orq %rdi, %rcx		; SSE2-NEXT: orq %rdi, %rcx
; ANY-NEXT: orq %r10, %rcx		; SSE2-NEXT: orq %r10, %rcx
; ANY-NEXT: xorq 32(%rsi), %rax		; SSE2-NEXT: xorq 32(%rsi), %rax
; ANY-NEXT: xorq 48(%rsi), %rdx		; SSE2-NEXT: xorq 48(%rsi), %rdx
; ANY-NEXT: orq %r9, %rdx		; SSE2-NEXT: orq %r9, %rdx
; ANY-NEXT: orq %rax, %rdx		; SSE2-NEXT: orq %rax, %rdx
; ANY-NEXT: orq %r8, %rdx		; SSE2-NEXT: orq %r8, %rdx
; ANY-NEXT: xorl %eax, %eax		; SSE2-NEXT: xorl %eax, %eax
; ANY-NEXT: orq %rcx, %rdx		; SSE2-NEXT: orq %rcx, %rdx
; ANY-NEXT: setne %al		; SSE2-NEXT: setne %al
; ANY-NEXT: retq		; SSE2-NEXT: retq
		;
		; AVX1-LABEL: ne_i256_pair:
		; AVX1: # %bb.0:
		; AVX1-NEXT: movq 16(%rdi), %r9
		; AVX1-NEXT: movq 24(%rdi), %r11
		; AVX1-NEXT: movq (%rdi), %r8
		; AVX1-NEXT: movq 8(%rdi), %r10
		; AVX1-NEXT: xorq 8(%rsi), %r10
		; AVX1-NEXT: xorq 24(%rsi), %r11
		; AVX1-NEXT: xorq (%rsi), %r8
		; AVX1-NEXT: xorq 16(%rsi), %r9
		; AVX1-NEXT: movq 48(%rdi), %rdx
		; AVX1-NEXT: movq 32(%rdi), %rax
		; AVX1-NEXT: movq 56(%rdi), %rcx
		; AVX1-NEXT: movq 40(%rdi), %rdi
		; AVX1-NEXT: xorq 40(%rsi), %rdi
		; AVX1-NEXT: xorq 56(%rsi), %rcx
		; AVX1-NEXT: orq %r11, %rcx
		; AVX1-NEXT: orq %rdi, %rcx
		; AVX1-NEXT: orq %r10, %rcx
		; AVX1-NEXT: xorq 32(%rsi), %rax
		; AVX1-NEXT: xorq 48(%rsi), %rdx
		; AVX1-NEXT: orq %r9, %rdx
		; AVX1-NEXT: orq %rax, %rdx
		; AVX1-NEXT: orq %r8, %rdx
		; AVX1-NEXT: xorl %eax, %eax
		; AVX1-NEXT: orq %rcx, %rdx
		; AVX1-NEXT: setne %al
		; AVX1-NEXT: retq
		;
		; AVX256-LABEL: ne_i256_pair:
		; AVX256: # %bb.0:
		; AVX256-NEXT: vmovdqu (%rdi), %ymm0
		; AVX256-NEXT: vmovdqu 32(%rdi), %ymm1
		; AVX256-NEXT: vpcmpeqb 32(%rsi), %ymm1, %ymm1
		; AVX256-NEXT: vpcmpeqb (%rsi), %ymm0, %ymm0
		; AVX256-NEXT: vpand %ymm1, %ymm0, %ymm0
		; AVX256-NEXT: vpmovmskb %ymm0, %ecx
		; AVX256-NEXT: xorl %eax, %eax
		; AVX256-NEXT: cmpl $-1, %ecx
		; AVX256-NEXT: setne %al
		; AVX256-NEXT: vzeroupper
		; AVX256-NEXT: retq
%a0 = load i256, i256* %a		%a0 = load i256, i256* %a
%b0 = load i256, i256* %b		%b0 = load i256, i256* %b
%xor1 = xor i256 %a0, %b0		%xor1 = xor i256 %a0, %b0
%ap1 = getelementptr i256, i256* %a, i256 1		%ap1 = getelementptr i256, i256* %a, i256 1
%bp1 = getelementptr i256, i256* %b, i256 1		%bp1 = getelementptr i256, i256* %b, i256 1
%a1 = load i256, i256* %ap1		%a1 = load i256, i256* %ap1
%b1 = load i256, i256* %bp1		%b1 = load i256, i256* %bp1
%xor2 = xor i256 %a1, %b1		%xor2 = xor i256 %a1, %b1
%or = or i256 %xor1, %xor2		%or = or i256 %xor1, %xor2
%cmp = icmp ne i256 %or, 0		%cmp = icmp ne i256 %or, 0
%z = zext i1 %cmp to i32		%z = zext i1 %cmp to i32
ret i32 %z		ret i32 %z
}		}

; This test models the expansion of 'memcmp(a, b, 64) == 0'		; This test models the expansion of 'memcmp(a, b, 64) == 0'
; if we allowed 2 pairs of 32-byte loads per block.		; if we allowed 2 pairs of 32-byte loads per block.

define i32 @eq_i256_pair(i256* %a, i256* %b) {		define i32 @eq_i256_pair(i256* %a, i256* %b) {
; ANY-LABEL: eq_i256_pair:		; SSE2-LABEL: eq_i256_pair:
; ANY: # %bb.0:		; SSE2: # %bb.0:
; ANY-NEXT: movq 16(%rdi), %r9		; SSE2-NEXT: movq 16(%rdi), %r9
; ANY-NEXT: movq 24(%rdi), %r11		; SSE2-NEXT: movq 24(%rdi), %r11
; ANY-NEXT: movq (%rdi), %r8		; SSE2-NEXT: movq (%rdi), %r8
; ANY-NEXT: movq 8(%rdi), %r10		; SSE2-NEXT: movq 8(%rdi), %r10
; ANY-NEXT: xorq 8(%rsi), %r10		; SSE2-NEXT: xorq 8(%rsi), %r10
; ANY-NEXT: xorq 24(%rsi), %r11		; SSE2-NEXT: xorq 24(%rsi), %r11
; ANY-NEXT: xorq (%rsi), %r8		; SSE2-NEXT: xorq (%rsi), %r8
; ANY-NEXT: xorq 16(%rsi), %r9		; SSE2-NEXT: xorq 16(%rsi), %r9
; ANY-NEXT: movq 48(%rdi), %rdx		; SSE2-NEXT: movq 48(%rdi), %rdx
; ANY-NEXT: movq 32(%rdi), %rax		; SSE2-NEXT: movq 32(%rdi), %rax
; ANY-NEXT: movq 56(%rdi), %rcx		; SSE2-NEXT: movq 56(%rdi), %rcx
; ANY-NEXT: movq 40(%rdi), %rdi		; SSE2-NEXT: movq 40(%rdi), %rdi
; ANY-NEXT: xorq 40(%rsi), %rdi		; SSE2-NEXT: xorq 40(%rsi), %rdi
; ANY-NEXT: xorq 56(%rsi), %rcx		; SSE2-NEXT: xorq 56(%rsi), %rcx
; ANY-NEXT: orq %r11, %rcx		; SSE2-NEXT: orq %r11, %rcx
; ANY-NEXT: orq %rdi, %rcx		; SSE2-NEXT: orq %rdi, %rcx
; ANY-NEXT: orq %r10, %rcx		; SSE2-NEXT: orq %r10, %rcx
; ANY-NEXT: xorq 32(%rsi), %rax		; SSE2-NEXT: xorq 32(%rsi), %rax
; ANY-NEXT: xorq 48(%rsi), %rdx		; SSE2-NEXT: xorq 48(%rsi), %rdx
; ANY-NEXT: orq %r9, %rdx		; SSE2-NEXT: orq %r9, %rdx
; ANY-NEXT: orq %rax, %rdx		; SSE2-NEXT: orq %rax, %rdx
; ANY-NEXT: orq %r8, %rdx		; SSE2-NEXT: orq %r8, %rdx
; ANY-NEXT: xorl %eax, %eax		; SSE2-NEXT: xorl %eax, %eax
; ANY-NEXT: orq %rcx, %rdx		; SSE2-NEXT: orq %rcx, %rdx
; ANY-NEXT: sete %al		; SSE2-NEXT: sete %al
; ANY-NEXT: retq		; SSE2-NEXT: retq
		;
		; AVX1-LABEL: eq_i256_pair:
		; AVX1: # %bb.0:
		; AVX1-NEXT: movq 16(%rdi), %r9
		; AVX1-NEXT: movq 24(%rdi), %r11
		; AVX1-NEXT: movq (%rdi), %r8
		; AVX1-NEXT: movq 8(%rdi), %r10
		; AVX1-NEXT: xorq 8(%rsi), %r10
		; AVX1-NEXT: xorq 24(%rsi), %r11
		; AVX1-NEXT: xorq (%rsi), %r8
		; AVX1-NEXT: xorq 16(%rsi), %r9
		; AVX1-NEXT: movq 48(%rdi), %rdx
		; AVX1-NEXT: movq 32(%rdi), %rax
		; AVX1-NEXT: movq 56(%rdi), %rcx
		; AVX1-NEXT: movq 40(%rdi), %rdi
		; AVX1-NEXT: xorq 40(%rsi), %rdi
		; AVX1-NEXT: xorq 56(%rsi), %rcx
		; AVX1-NEXT: orq %r11, %rcx
		; AVX1-NEXT: orq %rdi, %rcx
		; AVX1-NEXT: orq %r10, %rcx
		; AVX1-NEXT: xorq 32(%rsi), %rax
		; AVX1-NEXT: xorq 48(%rsi), %rdx
		; AVX1-NEXT: orq %r9, %rdx
		; AVX1-NEXT: orq %rax, %rdx
		; AVX1-NEXT: orq %r8, %rdx
		; AVX1-NEXT: xorl %eax, %eax
		; AVX1-NEXT: orq %rcx, %rdx
		; AVX1-NEXT: sete %al
		; AVX1-NEXT: retq
		;
		; AVX256-LABEL: eq_i256_pair:
		; AVX256: # %bb.0:
		; AVX256-NEXT: vmovdqu (%rdi), %ymm0
		; AVX256-NEXT: vmovdqu 32(%rdi), %ymm1
		; AVX256-NEXT: vpcmpeqb 32(%rsi), %ymm1, %ymm1
		; AVX256-NEXT: vpcmpeqb (%rsi), %ymm0, %ymm0
		; AVX256-NEXT: vpand %ymm1, %ymm0, %ymm0
		; AVX256-NEXT: vpmovmskb %ymm0, %ecx
		; AVX256-NEXT: xorl %eax, %eax
		; AVX256-NEXT: cmpl $-1, %ecx
		; AVX256-NEXT: sete %al
		; AVX256-NEXT: vzeroupper
		; AVX256-NEXT: retq
%a0 = load i256, i256* %a		%a0 = load i256, i256* %a
%b0 = load i256, i256* %b		%b0 = load i256, i256* %b
%xor1 = xor i256 %a0, %b0		%xor1 = xor i256 %a0, %b0
%ap1 = getelementptr i256, i256* %a, i256 1		%ap1 = getelementptr i256, i256* %a, i256 1
%bp1 = getelementptr i256, i256* %b, i256 1		%bp1 = getelementptr i256, i256* %b, i256 1
%a1 = load i256, i256* %ap1		%a1 = load i256, i256* %ap1
%b1 = load i256, i256* %bp1		%b1 = load i256, i256* %bp1
%xor2 = xor i256 %a1, %b1		%xor2 = xor i256 %a1, %b1
%or = or i256 %xor1, %xor2		%or = or i256 %xor1, %xor2
%cmp = icmp eq i256 %or, 0		%cmp = icmp eq i256 %or, 0
%z = zext i1 %cmp to i32		%z = zext i1 %cmp to i32
ret i32 %z		ret i32 %z
}		}