Download Raw Diff

Details

Reviewers

RKSimon
zvi
craig.topper
efriedma

Commits

rG9ebb68843e57: [x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality
rL298775: [x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality

Summary

This is the payoff for D31156 - if a target has efficient comparison instructions for vector-sized equality, we can replace memcmp calls with inline code that is both smaller and faster.

Seems like we're missing a load folding opportunity on the first test, but that's a separate problem.

I can enable the 32-byte case for AVX2 as an immediate follow-up, but I want to make sure this part looks ok before adding that.

Diff Detail

Event Timeline

spatel created this revision.Mar 23 2017, 9:17 AM

Herald added a subscriber: mcrosier. · View Herald TranscriptMar 23 2017, 9:17 AM

efriedma added a subscriber: efriedma.Mar 23 2017, 12:40 PM

efriedma added inline comments.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6116	What's the point of performing the load in a vector type if you're going to immediately bitcast the result to an integer type? IIRC DAGCombine will fold this away.
test/CodeGen/X86/memcmp.ll
104	What's the performance of this compared to using integer registers? (movq+xorq+movq+xorq+orq).

spatel added inline comments.Mar 23 2017, 1:40 PM

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6116	I actually had it loading i128 to start, but I saw 2 problems: The i128 loads+bitcasts weren't converted to vector loads directly. Legalization for x86-64 split this into i64 loads, and then we had to rely on the combiner to merge the loads. At the least, I think this would be slower to compile since it caused more nodes to be created and folded. At worst, we might not put the loads back together properly and that would lead to poor code. It wasn't clear to me that I could add a generic combine to do that either since some targets might not want that. It wasn't honest to use i128 loads and bypass the isTypeLegal() check. We could make the TLI hook more specialized to account for that - have it confirm that loads of a given type/size are fast, so it's truly just a memcmp hook. But given the first problem, I got scared away.
test/CodeGen/X86/memcmp.ll
104	Hmm...didn't consider that option since movmsk has been fast for a long time and scalar always needs more ops. We'd need to separate x86-32 from x86-64 too. I'll try to get some real numbers.

spatel added inline comments.Mar 23 2017, 2:32 PM

test/CodeGen/X86/memcmp.ll
104	I benchmarked the 2 sequences shown below and the libcall. On Haswell with macOS, I'm seeing more wobble in these numbers than I can explain, but: memcmp : 34485936 cycles for 1048576 iterations (32.89 cycles/iter). vec cmp : 5245888 cycles for 1048576 iterations (5.00 cycles/iter). xor cmp : 5247940 cycles for 1048576 iterations (5.00 cycles/iter). On Ubuntu with AMD Jaguar: memcmp : 21150343 cycles for 1048576 iterations (20.17 cycles/iter). vec cmp : 9988395 cycles for 1048576 iterations (9.53 cycles/iter). xor cmp : 9471849 cycles for 1048576 iterations (9.03 cycles/iter). .align 6, 0x90 .global _cmp16vec _cmp16vec: movdqu (%rsi), %xmm0 movdqu (%rdi), %xmm1 pcmpeqb %xmm0, %xmm1 pmovmskb %xmm1, %eax cmpl $65535, %eax setne %al movzbl %al, %eax retq .align 6, 0x90 .global _cmp16scalar _cmp16scalar: movq (%rsi), %rax movq 8(%rsi), %rcx xorq (%rdi), %rax xorq 8(%rdi), %rcx orq %rax, %rcx setne %al movzbl %al, %eax retq

spatel added inline comments.Mar 23 2017, 3:36 PM

test/CodeGen/X86/memcmp.ll
104	There will be bugs: https://bugs.llvm.org/show_bug.cgi?id=32401

(clang produces the xor sequence if you just write int x(__int128_t*x, __int128_t*y) { return *x == *y; }.)

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6116	If DAGCombine doesn't fold it away, this is fine, I guess. Maybe let the target specify the type to use, in case some target wants to use a type that isn't `<4 x i32>`?

spatel added inline comments.Mar 24 2017, 9:56 AM

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6116	Yes - that would be better. We can cycle through the possible simple types (including i128), and the target can let us know what works. Also, your example using "__int128_t" probably explains why we saw/expected different things after this step in the DAG. If the loads are aligned, then we will legalize these to v16i8 loads for an SSE2 target, but not if they are unaligned as I was seeing in my experiments.

Patch updated:
Check all of the 16-byte simple value types before giving up.

Eli pointed me to D28637 (which I hadn't seen of course!) - a general solution for memcmp transformation. Not sure if this specialization still makes sense given that patch, but since I already made the edits, I'll post it.

Not sure if this specialization still makes sense given that patch, but since I already made the edits, I'll post it.

Even with that patch, we probably still want a similar target hook. Might as well finish/merge this now, then make sure we continue to generate the same efficient code when x86 transitions to the new memcmp lowering.

lib/Target/X86/X86ISelLowering.h
819	It probably makes sense to make this take a size in bytes, and return a VT, rather than calling this with every possible VT.

spatel marked an inline comment as done.Mar 24 2017, 1:42 PM

spatel added inline comments.

lib/Target/X86/X86ISelLowering.h
819	Yep - that makes the patch simpler.

Patch updated:
Have the TLI hook return the preferred operand (load) type for a given bitwidth, so we don't have to cycle through all of those when transforming the memcmp().

I'm using EVT instead of MVT in the hook anticipating that we extend this to 256-bit types for AVX2. In that case, we'd use i256 which isn't an MVT / simple type, so we'd have to switch it at that point unless I'm misunderstanding how these things work.

Patch updated:
On 2nd thought, that EVT/MVT argument makes no sense. The returned type from the hook is always going to be an MVT because it will be a supported type in order to be fast. Using MVT makes the code a bit cleaner since we don't have to pass a context around for those.

efriedma added inline comments.Mar 24 2017, 2:44 PM

lib/Target/X86/X86ISelLowering.cpp
4646	Maybe check isTypeLegal(MVT::v16i8) instead? hasSSE2() doesn't mean what you want it to.
4650	Maybe also 64-bit types (on a 32-bit target).
test/CodeGen/X86/memcmp.ll
2	Could you regenerate this test so it also compiles for a 32-bit target?

Patch updated:

Added 32-bit target testing in rL298744
Don't use hasSSE2() in the x86 override - that won't work if we're in soft-float mode (nice catch!).
Add TODO comment to handle 64-bit type on x86 32-bit target.

LGTM.

This revision is now accepted and ready to land.Mar 24 2017, 4:33 PM

Closed by commit rL298775: [x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality (authored by spatel). · Explain WhyMar 25 2017, 9:17 AM

This revision was automatically updated to reflect the committed changes.

Diff 92974

include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 431 Lines • ▼ Show 20 Lines	public:
/// into a single machine instruction of a form like:		/// into a single machine instruction of a form like:
/// \code		/// \code
/// cc = test %register, #mask		/// cc = test %register, #mask
/// \endcode		/// \endcode
virtual bool isMaskAndCmp0FoldingBeneficial(const Instruction &AndI) const {		virtual bool isMaskAndCmp0FoldingBeneficial(const Instruction &AndI) const {
return false;		return false;
}		}

		/// Return true if the target has a quick way to compare values of the given
		/// type. By default, assume that any legal type can be compared efficiently.
		virtual bool hasFastEqualityCompare(EVT VT) const {
		return isTypeLegal(VT);
		}

/// Return true if the target should transform:		/// Return true if the target should transform:
/// (X & Y) == Y ---> (~X & Y) == 0		/// (X & Y) == Y ---> (~X & Y) == 0
/// (X & Y) != Y ---> (~X & Y) != 0		/// (X & Y) != Y ---> (~X & Y) != 0
///		///
/// This may be profitable if the target has a bitwise and-not operation that		/// This may be profitable if the target has a bitwise and-not operation that
/// sets comparison flags. A target may want to limit the transformation based		/// sets comparison flags. A target may want to limit the transformation based
/// on the type of Y or if Y is a constant.		/// on the type of Y or if Y is a constant.
///		///
▲ Show 20 Lines • Show All 2,777 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,949 Lines • ▼ Show 20 Lines	if (const ICmpInst *IC = dyn_cast<ICmpInst>(U))
continue;		continue;
// Unknown instruction.		// Unknown instruction.
return false;		return false;
}		}
return true;		return true;
}		}

static SDValue getMemCmpLoad(const Value *PtrVal, MVT LoadVT,		static SDValue getMemCmpLoad(const Value *PtrVal, MVT LoadVT,
Type *LoadTy,
SelectionDAGBuilder &Builder) {		SelectionDAGBuilder &Builder) {

// Check to see if this load can be trivially constant folded, e.g. if the		// Check to see if this load can be trivially constant folded, e.g. if the
// input is from a string literal.		// input is from a string literal.
if (const Constant *LoadInput = dyn_cast<Constant>(PtrVal)) {		if (const Constant *LoadInput = dyn_cast<Constant>(PtrVal)) {
// Cast pointer to the type we really want to load.		// Cast pointer to the type we really want to load.
		Type *LoadTy =
		Type::getIntNTy(PtrVal->getContext(), LoadVT.getScalarSizeInBits());
		if (LoadVT.isVector())
		LoadTy = VectorType::get(LoadTy, LoadVT.getVectorNumElements());

LoadInput = ConstantExpr::getBitCast(const_cast<Constant *>(LoadInput),		LoadInput = ConstantExpr::getBitCast(const_cast<Constant *>(LoadInput),
PointerType::getUnqual(LoadTy));		PointerType::getUnqual(LoadTy));

if (const Constant *LoadCst = ConstantFoldLoadFromConstPtr(		if (const Constant *LoadCst = ConstantFoldLoadFromConstPtr(
const_cast<Constant >(LoadInput), LoadTy, Builder.DL))		const_cast<Constant >(LoadInput), LoadTy, Builder.DL))
return Builder.getValue(LoadCst);		return Builder.getValue(LoadCst);
}		}

▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	if (Res.first.getNode()) {
return true;		return true;
}		}

// memcmp(S1,S2,2) != 0 -> ((short)LHS != (short)RHS) != 0		// memcmp(S1,S2,2) != 0 -> ((short)LHS != (short)RHS) != 0
// memcmp(S1,S2,4) != 0 -> ((int)LHS != (int)RHS) != 0		// memcmp(S1,S2,4) != 0 -> ((int)LHS != (int)RHS) != 0
if (!CSize \|\| !IsOnlyUsedInZeroEqualityComparison(&I))		if (!CSize \|\| !IsOnlyUsedInZeroEqualityComparison(&I))
return false;		return false;

MVT LoadVT;		// Require that the load VT is legal and that the target supports unaligned
Type *LoadTy;		// loads of that type. If the load VT is good, check that a scalar compare of
		// the load size is fast and return that type. Otherwise, return INVALID.
		auto hasFastLoadsAndCompare = [&](MVT LoadVT) {
		// TODO: Handle 5 byte compare as 4-byte + 1 byte.
		// TODO: Handle 8 byte compare on x86-32 as two 32-bit loads.
		// TODO: Check alignment of src and dest ptrs.

		unsigned DstAS = LHS->getType()->getPointerAddressSpace();
		unsigned SrcAS = RHS->getType()->getPointerAddressSpace();
		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		if (!TLI.isTypeLegal(LoadVT) \|\|
		!TLI.allowsMisalignedMemoryAccesses(LoadVT, SrcAS) \|\|
		!TLI.allowsMisalignedMemoryAccesses(LoadVT, DstAS))
		return MVT::INVALID_SIMPLE_VALUE_TYPE;

		// For a vector type, we need to do a scalar comparison of the whole vector.
		MVT CmpVT = LoadVT.isVector() ? LoadVT.getIntegerVT(LoadVT.getSizeInBits())
		: LoadVT;
		if (!TLI.hasFastEqualityCompare(CmpVT))
		return MVT::INVALID_SIMPLE_VALUE_TYPE;

		return CmpVT.SimpleTy;
		};

		// This turns into unaligned loads. We only do this if the target natively
		// supports the MVT we'll be loading or if it is small enough (<= 4) that
		// we'll only produce a small number of byte loads.
		MVT LoadVT = MVT::INVALID_SIMPLE_VALUE_TYPE;
		MVT CmpVT = MVT::INVALID_SIMPLE_VALUE_TYPE;
switch (CSize->getZExtValue()) {		switch (CSize->getZExtValue()) {
default:		default:
return false;		break;
case 2:		case 2:
LoadVT = MVT::i16;		LoadVT = CmpVT = MVT::i16;
LoadTy = Type::getInt16Ty(CSize->getContext());
break;		break;
case 4:		case 4:
LoadVT = MVT::i32;		LoadVT = CmpVT = MVT::i32;
LoadTy = Type::getInt32Ty(CSize->getContext());
break;		break;
case 8:		case 8:
LoadVT = MVT::i64;		LoadVT = MVT::i64;
LoadTy = Type::getInt64Ty(CSize->getContext());		CmpVT = hasFastLoadsAndCompare(LoadVT);
break;		break;
/*
case 16:		case 16:
LoadVT = MVT::v4i32;		// Find a 16-byte load type that is fast for this target.
LoadTy = Type::getInt32Ty(CSize->getContext());		for (MVT VT16Bytes :
LoadTy = VectorType::get(LoadTy, 4);		{MVT::i128, MVT::v16i8, MVT::v8i16, MVT::v4i32, MVT::v2i64}) {
		CmpVT = hasFastLoadsAndCompare(VT16Bytes);
		if (CmpVT != MVT::INVALID_SIMPLE_VALUE_TYPE) {
		LoadVT = VT16Bytes;
		break;
		}
		}
break;		break;
*/
}		}

// This turns into unaligned loads. We only do this if the target natively		if (LoadVT == MVT::INVALID_SIMPLE_VALUE_TYPE \|\|
// supports the MVT we'll be loading or if it is small enough (<= 4) that		CmpVT == MVT::INVALID_SIMPLE_VALUE_TYPE)
// we'll only produce a small number of byte loads.

// Require that we can find a legal MVT, and only do this if the target
// supports unaligned loads of that type. Expanding into byte loads would
// bloat the code.
const TargetLowering &TLI = DAG.getTargetLoweringInfo();
if (CSize->getZExtValue() > 4) {
unsigned DstAS = LHS->getType()->getPointerAddressSpace();
unsigned SrcAS = RHS->getType()->getPointerAddressSpace();
// TODO: Handle 5 byte compare as 4-byte + 1 byte.
// TODO: Handle 8 byte compare on x86-32 as two 32-bit loads.
// TODO: Check alignment of src and dest ptrs.
if (!TLI.isTypeLegal(LoadVT) \|\|
!TLI.allowsMisalignedMemoryAccesses(LoadVT, SrcAS) \|\|
!TLI.allowsMisalignedMemoryAccesses(LoadVT, DstAS))
return false;		return false;
}

SDValue LHSVal = getMemCmpLoad(LHS, LoadVT, LoadTy, *this);		SDValue LoadL = getMemCmpLoad(LHS, LoadVT, *this);
SDValue RHSVal = getMemCmpLoad(RHS, LoadVT, LoadTy, *this);		SDValue LoadR = getMemCmpLoad(RHS, LoadVT, *this);

		// Bitcast to integer type if the loads are vectors.
		LoadL = DAG.getBitcast(CmpVT, LoadL);
		LoadR = DAG.getBitcast(CmpVT, LoadR);

SDValue SetCC =		SDValue SetCC =
DAG.getSetCC(getCurSDLoc(), MVT::i1, LHSVal, RHSVal, ISD::SETNE);		DAG.getSetCC(getCurSDLoc(), MVT::i1, LoadL, LoadR, ISD::SETNE);
processIntegerCallValue(I, SetCC, false);		processIntegerCallValue(I, SetCC, false);
return true;		return true;
		efriedmaUnsubmitted Not Done Reply Inline Actions What's the point of performing the load in a vector type if you're going to immediately bitcast the result to an integer type? IIRC DAGCombine will fold this away. efriedma: What's the point of performing the load in a vector type if you're going to immediately bitcast…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions I actually had it loading i128 to start, but I saw 2 problems: The i128 loads+bitcasts weren't converted to vector loads directly. Legalization for x86-64 split this into i64 loads, and then we had to rely on the combiner to merge the loads. At the least, I think this would be slower to compile since it caused more nodes to be created and folded. At worst, we might not put the loads back together properly and that would lead to poor code. It wasn't clear to me that I could add a generic combine to do that either since some targets might not want that. It wasn't honest to use i128 loads and bypass the isTypeLegal() check. We could make the TLI hook more specialized to account for that - have it confirm that loads of a given type/size are fast, so it's truly just a memcmp hook. But given the first problem, I got scared away. spatel: I actually had it loading i128 to start, but I saw 2 problems: 1. The i128 loads+bitcasts…
		efriedmaUnsubmitted Not Done Reply Inline Actions If DAGCombine doesn't fold it away, this is fine, I guess. Maybe let the target specify the type to use, in case some target wants to use a type that isn't `<4 x i32>`? efriedma: If DAGCombine doesn't fold it away, this is fine, I guess. Maybe let the target specify the…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Yes - that would be better. We can cycle through the possible simple types (including i128), and the target can let us know what works. Also, your example using "__int128_t" probably explains why we saw/expected different things after this step in the DAG. If the loads are aligned, then we will legalize these to v16i8 loads for an SSE2 target, but not if they are unaligned as I was seeing in my experiments. spatel: Yes - that would be better. We can cycle through the possible simple types (including i128)…
}		}

/// See if we can lower a memchr call into an optimized form. If so, return		/// See if we can lower a memchr call into an optimized form. If so, return
/// true and lower it. Otherwise return false, and it will be lowered like a		/// true and lower it. Otherwise return false, and it will be lowered like a
/// normal call.		/// normal call.
/// The caller already checked that \p I calls the appropriate LibFunc with a		/// The caller already checked that \p I calls the appropriate LibFunc with a
/// correct prototype.		/// correct prototype.
bool SelectionDAGBuilder::visitMemChrCall(const CallInst &I) {		bool SelectionDAGBuilder::visitMemChrCall(const CallInst &I) {
▲ Show 20 Lines • Show All 3,433 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 809 Lines • ▼ Show 20 Lines	bool isMultiStoresCheaperThanBitsMerge(EVT LTy, EVT HTy) const override {
// such pair out until we get testcase to prove it is a win.		// such pair out until we get testcase to prove it is a win.
return false;		return false;
}		}

bool isMaskAndCmp0FoldingBeneficial(const Instruction &AndI) const override;		bool isMaskAndCmp0FoldingBeneficial(const Instruction &AndI) const override;

bool hasAndNotCompare(SDValue Y) const override;		bool hasAndNotCompare(SDValue Y) const override;

		/// Vector-sized comparisons are fast using PCMPEQ + PMOVMSK or PTEST.
		bool hasFastEqualityCompare(EVT VT) const override;
		efriedmaUnsubmitted Done Reply Inline Actions It probably makes sense to make this take a size in bytes, and return a VT, rather than calling this with every possible VT. efriedma: It probably makes sense to make this take a size in bytes, and return a VT, rather than calling…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Yep - that makes the patch simpler. spatel: Yep - that makes the patch simpler.

/// Return the value type to use for ISD::SETCC.		/// Return the value type to use for ISD::SETCC.
EVT getSetCCResultType(const DataLayout &DL, LLVMContext &Context,		EVT getSetCCResultType(const DataLayout &DL, LLVMContext &Context,
EVT VT) const override;		EVT VT) const override;

/// Determine which of the bits specified in Mask are known to be either		/// Determine which of the bits specified in Mask are known to be either
/// zero or one and return them in the KnownZero/KnownOne bitsets.		/// zero or one and return them in the KnownZero/KnownOne bitsets.
void computeKnownBitsForTargetNode(const SDValue Op,		void computeKnownBitsForTargetNode(const SDValue Op,
APInt &KnownZero,		APInt &KnownZero,
▲ Show 20 Lines • Show All 568 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,631 Lines • ▼ Show 20 Lines	bool X86TargetLowering::hasAndNotCompare(SDValue Y) const {
// There are only 32-bit and 64-bit forms for 'andn'.		// There are only 32-bit and 64-bit forms for 'andn'.
EVT VT = Y.getValueType();		EVT VT = Y.getValueType();
if (VT != MVT::i32 && VT != MVT::i64)		if (VT != MVT::i32 && VT != MVT::i64)
return false;		return false;

return true;		return true;
}		}

		bool X86TargetLowering::hasFastEqualityCompare(EVT VT) const {
		// TODO: 256- and 512-bit types should be allowed, but make sure that those
		// cases are handled in combineVectorSizedSetCCEquality().
		return isTypeLegal(VT) \|\| (Subtarget.hasSSE2() && VT == MVT::i128);
		}

/// Val is the undef sentinel value or equal to the specified value.		/// Val is the undef sentinel value or equal to the specified value.
		efriedmaUnsubmitted Done Reply Inline Actions Maybe check isTypeLegal(MVT::v16i8) instead? hasSSE2() doesn't mean what you want it to. efriedma: Maybe check isTypeLegal(MVT::v16i8) instead? hasSSE2() doesn't mean what you want it to.
static bool isUndefOrEqual(int Val, int CmpVal) {		static bool isUndefOrEqual(int Val, int CmpVal) {
return ((Val == SM_SentinelUndef) \|\| (Val == CmpVal));		return ((Val == SM_SentinelUndef) \|\| (Val == CmpVal));
}		}

		efriedmaUnsubmitted Done Reply Inline Actions Maybe also 64-bit types (on a 32-bit target). efriedma: Maybe also 64-bit types (on a 32-bit target).
/// Val is either the undef or zero sentinel value.		/// Val is either the undef or zero sentinel value.
static bool isUndefOrZero(int Val) {		static bool isUndefOrZero(int Val) {
return ((Val == SM_SentinelUndef) \|\| (Val == SM_SentinelZero));		return ((Val == SM_SentinelUndef) \|\| (Val == SM_SentinelZero));
}		}

/// Return true if every element in Mask, beginning		/// Return true if every element in Mask, beginning
/// from position Pos and ending in Pos+Size is the undef sentinel value.		/// from position Pos and ending in Pos+Size is the undef sentinel value.
static bool isUndefInRange(ArrayRef<int> Mask, unsigned Pos, unsigned Size) {		static bool isUndefInRange(ArrayRef<int> Mask, unsigned Pos, unsigned Size) {
▲ Show 20 Lines • Show All 31,207 Lines • Show Last 20 Lines

test/CodeGen/X86/memcmp.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc < %s -mtriple=x86_64-unknown-unknown \| FileCheck %s		; RUN: llc < %s -mtriple=x86_64-unknown-unknown \| FileCheck %s
		efriedmaUnsubmitted Done Reply Inline Actions Could you regenerate this test so it also compiles for a 32-bit target? efriedma: Could you regenerate this test so it also compiles for a 32-bit target?

; This tests codegen time inlining/optimization of memcmp		; This tests codegen time inlining/optimization of memcmp
; rdar://6480398		; rdar://6480398

@.str = private constant [65 x i8] c"0123456789012345678901234567890123456789012345678901234567890123\00", align 1		@.str = private constant [65 x i8] c"0123456789012345678901234567890123456789012345678901234567890123\00", align 1

declare i32 @memcmp(i8, i8, i64)		declare i32 @memcmp(i8, i8, i64)

▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq
%m = tail call i32 @memcmp(i8* %X, i8* getelementptr inbounds ([65 x i8], [65 x i8]* @.str, i32 0, i32 0), i64 8) nounwind		%m = tail call i32 @memcmp(i8* %X, i8* getelementptr inbounds ([65 x i8], [65 x i8]* @.str, i32 0, i32 0), i64 8) nounwind
%c = icmp ne i32 %m, 0		%c = icmp ne i32 %m, 0
ret i1 %c		ret i1 %c
}		}

define i1 @length16(i8* %x, i8* %y) nounwind {		define i1 @length16(i8* %x, i8* %y) nounwind {
; CHECK-LABEL: length16:		; CHECK-LABEL: length16:
; CHECK: # BB#0:		; CHECK: # BB#0:
; CHECK-NEXT: pushq %rax		; CHECK-NEXT: movdqu (%rsi), %xmm0
; CHECK-NEXT: movl $16, %edx		; CHECK-NEXT: movdqu (%rdi), %xmm1
; CHECK-NEXT: callq memcmp		; CHECK-NEXT: pcmpeqb %xmm0, %xmm1
; CHECK-NEXT: testl %eax, %eax		; CHECK-NEXT: pmovmskb %xmm1, %eax
		; CHECK-NEXT: cmpl $65535, %eax # imm = 0xFFFF
		efriedmaUnsubmitted Not Done Reply Inline Actions What's the performance of this compared to using integer registers? (movq+xorq+movq+xorq+orq). efriedma: What's the performance of this compared to using integer registers? (movq+xorq+movq+xorq+orq).
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Hmm...didn't consider that option since movmsk has been fast for a long time and scalar always needs more ops. We'd need to separate x86-32 from x86-64 too. I'll try to get some real numbers. spatel: Hmm...didn't consider that option since movmsk has been fast for a long time and scalar always…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions I benchmarked the 2 sequences shown below and the libcall. On Haswell with macOS, I'm seeing more wobble in these numbers than I can explain, but: memcmp : 34485936 cycles for 1048576 iterations (32.89 cycles/iter). vec cmp : 5245888 cycles for 1048576 iterations (5.00 cycles/iter). xor cmp : 5247940 cycles for 1048576 iterations (5.00 cycles/iter). On Ubuntu with AMD Jaguar: memcmp : 21150343 cycles for 1048576 iterations (20.17 cycles/iter). vec cmp : 9988395 cycles for 1048576 iterations (9.53 cycles/iter). xor cmp : 9471849 cycles for 1048576 iterations (9.03 cycles/iter). .align 6, 0x90 .global _cmp16vec _cmp16vec: movdqu (%rsi), %xmm0 movdqu (%rdi), %xmm1 pcmpeqb %xmm0, %xmm1 pmovmskb %xmm1, %eax cmpl $65535, %eax setne %al movzbl %al, %eax retq .align 6, 0x90 .global _cmp16scalar _cmp16scalar: movq (%rsi), %rax movq 8(%rsi), %rcx xorq (%rdi), %rax xorq 8(%rdi), %rcx orq %rax, %rcx setne %al movzbl %al, %eax retq spatel: I benchmarked the 2 sequences shown below and the libcall. On Haswell with macOS, I'm seeing…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions There will be bugs: https://bugs.llvm.org/show_bug.cgi?id=32401 spatel: There will be bugs: https://bugs.llvm.org/show_bug.cgi?id=32401
; CHECK-NEXT: setne %al		; CHECK-NEXT: setne %al
; CHECK-NEXT: popq %rcx
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%call = tail call i32 @memcmp(i8* %x, i8* %y, i64 16) nounwind		%call = tail call i32 @memcmp(i8* %x, i8* %y, i64 16) nounwind
%cmp = icmp ne i32 %call, 0		%cmp = icmp ne i32 %call, 0
ret i1 %cmp		ret i1 %cmp
}		}

define i1 @length16_const(i8* %X, i32* nocapture %P) nounwind {		define i1 @length16_const(i8* %X, i32* nocapture %P) nounwind {
; CHECK-LABEL: length16_const:		; CHECK-LABEL: length16_const:
; CHECK: # BB#0:		; CHECK: # BB#0:
; CHECK-NEXT: pushq %rax		; CHECK-NEXT: movdqu (%rdi), %xmm0
; CHECK-NEXT: movl $.L.str, %esi		; CHECK-NEXT: pcmpeqb {{.*}}(%rip), %xmm0
; CHECK-NEXT: movl $16, %edx		; CHECK-NEXT: pmovmskb %xmm0, %eax
; CHECK-NEXT: callq memcmp		; CHECK-NEXT: cmpl $65535, %eax # imm = 0xFFFF
; CHECK-NEXT: testl %eax, %eax
; CHECK-NEXT: sete %al		; CHECK-NEXT: sete %al
; CHECK-NEXT: popq %rcx
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%m = tail call i32 @memcmp(i8* %X, i8* getelementptr inbounds ([65 x i8], [65 x i8]* @.str, i32 0, i32 0), i64 16) nounwind		%m = tail call i32 @memcmp(i8* %X, i8* getelementptr inbounds ([65 x i8], [65 x i8]* @.str, i32 0, i32 0), i64 16) nounwind
%c = icmp eq i32 %m, 0		%c = icmp eq i32 %m, 0
ret i1 %c		ret i1 %c
}		}

define i1 @length32(i8* %x, i8* %y) nounwind {		define i1 @length32(i8* %x, i8* %y) nounwind {
; CHECK-LABEL: length32:		; CHECK-LABEL: length32:
▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 92974

include/llvm/Target/TargetLowering.h

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

lib/Target/X86/X86ISelLowering.h

lib/Target/X86/X86ISelLowering.cpp

test/CodeGen/X86/memcmp.ll

This is an archive of the discontinued LLVM Phabricator instance.

[x86] use PMOVMSK to replace memcmp libcalls for 16-byte equalityClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 92974

include/llvm/Target/TargetLowering.h

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

lib/Target/X86/X86ISelLowering.h

lib/Target/X86/X86ISelLowering.cpp

test/CodeGen/X86/memcmp.ll

[x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality
ClosedPublic