This is an archive of the discontinued LLVM Phabricator instance.

[LibCallSimplifier] try harder to fold memcmp with constant arguments
ClosedPublic

Authored by spatel on Aug 19 2017, 8:20 AM.

Download Raw Diff

Details

Reviewers

joerg
davide
efriedma
majnemer

Commits

rG82ec872990f0: [LibCallSimplifier] try harder to fold memcmp with constant arguments (2nd try)
rG7756edfa9368: [LibCallSimplifier] try harder to fold memcmp with constant arguments
rL311366: [LibCallSimplifier] try harder to fold memcmp with constant arguments (2nd try)
rL311333: [LibCallSimplifier] try harder to fold memcmp with constant arguments

Summary

Try to fold:
memcmp(X, C, ConstantLength) == 0 --> load X == *C

Without this change, we're unnecessarily checking the alignment of the constant data, so we miss the transform in the first 2 tests in the patch.

I noted this shortcoming of LibCallSimpifier in one of the recent CGP memcmp expansion patches. This doesn't help the example in:
https://bugs.llvm.org/show_bug.cgi?id=34032#c13
...directly, but I think it's worth short-circuiting more of these simple cases since we're already trying to do that.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Aug 19 2017, 8:20 AM

Herald added a subscriber: mcrosier. · View Herald TranscriptAug 19 2017, 8:20 AM

spatel edited the summary of this revision. (Show Details)Aug 19 2017, 8:27 AM

What GCC does in this case? (or icc?)

In D36922#846470, @davide wrote:

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

In D36922#846477, @spatel wrote:

In D36922#846470, @davide wrote:

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

Do you have an example of a common prefix opt that fires if we perform this lowering as part of the instruction combiner?

In D36922#846488, @davide wrote:

In D36922#846477, @spatel wrote:

In D36922#846470, @davide wrote:

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

Do you have an example of a common prefix opt that fires if we perform this lowering as part of the instruction combiner?

No. AFAIK, no such common prefix optimization exists in llvm.

In D36922#846489, @spatel wrote:

In D36922#846488, @davide wrote:

In D36922#846477, @spatel wrote:

In D36922#846470, @davide wrote:

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

Do you have an example of a common prefix opt that fires if we perform this lowering as part of the instruction combiner?

No. AFAIK, no such common prefix optimization exists in llvm.

So, maybe we might consider postponing this change until a real use-case shows up?
This doesn't seem to be a lot of code to add when/if a real opportunity shows up.

In D36922#846492, @davide wrote:

In D36922#846489, @spatel wrote:

In D36922#846488, @davide wrote:

In D36922#846477, @spatel wrote:

In D36922#846470, @davide wrote:

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

Do you have an example of a common prefix opt that fires if we perform this lowering as part of the instruction combiner?

No. AFAIK, no such common prefix optimization exists in llvm.

So, maybe we might consider postponing this change until a real use-case shows up?
This doesn't seem to be a lot of code to add when/if a real opportunity shows up.

I'm not sure why that's better. If you're advocating that we should remove early memcmp transforms entirely, then the burden of proof for that change is much higher as I understand it. But this patch is just trying to fix a logic hole in the existing transform.

The motivation for fixing this now is that there are cases where earlier analysis/expansion of small memcmp would be a perf win, so I'd like to remove this roadblock from affecting work on that going forward.

A potential example of that is the attachment in PR34032 (that's the optimized IR from an llvm source file - lib/IR/Function.cpp). The IR has ~5500 memcmp calls in it, and even in its current limited form for x86, CGP memcmp expansion will transform over 4K of those calls to inline IR (ie, the majority of memcmps are constant length and less than 16 bytes). If you then run that IR through the normal opt -O2 pipeline, it gets significantly smaller, so I'm taking that as early evidence that there's room for improvement.

In D36922#846540, @spatel wrote:

In D36922#846492, @davide wrote:

In D36922#846489, @spatel wrote:

In D36922#846488, @davide wrote:

In D36922#846477, @spatel wrote:

In D36922#846470, @davide wrote:

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

Do you have an example of a common prefix opt that fires if we perform this lowering as part of the instruction combiner?

No. AFAIK, no such common prefix optimization exists in llvm.

So, maybe we might consider postponing this change until a real use-case shows up?
This doesn't seem to be a lot of code to add when/if a real opportunity shows up.

I'm not sure why that's better. If you're advocating that we should remove early memcmp transforms entirely, then the burden of proof for that change is much higher as I understand it. But this patch is just trying to fix a logic hole in the existing transform.

The motivation for fixing this now is that there are cases where earlier analysis/expansion of small memcmp would be a perf win, so I'd like to remove this roadblock from affecting work on that going forward.

A potential example of that is the attachment in PR34032 (that's the optimized IR from an llvm source file - lib/IR/Function.cpp). The IR has ~5500 memcmp calls in it, and even in its current limited form for x86, CGP memcmp expansion will transform over 4K of those calls to inline IR (ie, the majority of memcmps are constant length and less than 16 bytes). If you then run that IR through the normal opt -O2 pipeline, it gets significantly smaller, so I'm taking that as early evidence that there's room for improvement.

While I understand your motivation, I don't see the immediate benefits. I'm happy to have this patch in InstCombine if you can show a case we can't catch already.

In D36922#846541, @davide wrote:

While I understand your motivation, I don't see the immediate benefits. I'm happy to have this patch in InstCombine if you can show a case we can't catch already.

I don't understand your motivation in delaying/blocking this patch. Are you opposed to the existing transform?

The immediate benefit is shown in the test cases in this patch: we're replacing a libcall + cmp with a load + cmp. The transform exposes further IR-level optimizations because we can and do reason more effectively about a load + cmp than a memcmp libcall. If it's not clear, I can check in the test cases with their current behavior as the baseline, so we just have the test diffs here?

Canonicalizing away from memcmp seems pretty good to me. An easy way (I think) to show benefit is that multiple memcmps involving the same non-constant operand will CSE the loads which results in a considerably more analyzable result.

In D36922#846700, @majnemer wrote:

Canonicalizing away from memcmp seems pretty good to me. An easy way (I think) to show benefit is that multiple memcmps involving the same non-constant operand will CSE the loads which results in a considerably more analyzable result.

That's correct, and sorry if it wasn't clear, but let me link directly to https://bugs.llvm.org/show_bug.cgi?id=34032#c13 which has this manufactured example:
https://godbolt.org/g/QjVXvS

Does that provide the required motivation/benefit?

In D36922#846834, @spatel wrote:

In D36922#846700, @majnemer wrote:

Canonicalizing away from memcmp seems pretty good to me. An easy way (I think) to show benefit is that multiple memcmps involving the same non-constant operand will CSE the loads which results in a considerably more analyzable result.

That's correct, and sorry if it wasn't clear, but let me link directly to https://bugs.llvm.org/show_bug.cgi?id=34032#c13 which has this manufactured example:
https://godbolt.org/g/QjVXvS

Does that provide the required motivation/benefit?

In my eyes, yes.

Oh, I missed the CSE case. LGTM.

This revision is now accepted and ready to land.Aug 21 2017, 5:27 AM

Closed by commit rL311333: [LibCallSimplifier] try harder to fold memcmp with constant arguments (authored by spatel). · Explain WhyAug 21 2017, 6:57 AM

This revision was automatically updated to reflect the committed changes.

efriedma added inline comments.Aug 23 2017, 1:03 PM

llvm/trunk/lib/Transforms/Utils/SimplifyLibCalls.cpp
778	Are you sure this is right? It looks like it you'll create a dead load if the LHS is aligned, but the RHS isn't (and therefore drive instcombine into an infinite loop).

spatel added inline comments.Aug 23 2017, 1:10 PM

llvm/trunk/lib/Transforms/Utils/SimplifyLibCalls.cpp
778	No - I'm sure this is wrong in exactly the way you've noted. :) Sorry the review didn't get updated here in Phab, but I reverted this version of the patch at rL311340 and recommitted with a fix and extra test case at rL311366. Please let me know if you see any problems there.

Oh, sorry, I'm way behind on my email and didn't see it. Yes, that looks fine.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Utils/

SimplifyLibCalls.cpp

36 lines

test/

Transforms/

InstCombine/

memcmp-constant-fold.ll

65 lines

Diff 111971

llvm/trunk/lib/Transforms/Utils/SimplifyLibCalls.cpp

Show All 12 Lines
// that performs serious instruction folding, use the instcombine pass instead.		// that performs serious instruction folding, use the instcombine pass instead.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/Transforms/Utils/SimplifyLibCalls.h"		#include "llvm/Transforms/Utils/SimplifyLibCalls.h"
#include "llvm/ADT/SmallString.h"		#include "llvm/ADT/SmallString.h"
#include "llvm/ADT/StringMap.h"		#include "llvm/ADT/StringMap.h"
#include "llvm/ADT/Triple.h"		#include "llvm/ADT/Triple.h"
		#include "llvm/Analysis/ConstantFolding.h"
#include "llvm/Analysis/OptimizationDiagnosticInfo.h"		#include "llvm/Analysis/OptimizationDiagnosticInfo.h"
#include "llvm/Analysis/TargetLibraryInfo.h"		#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Intrinsics.h"		#include "llvm/IR/Intrinsics.h"
▲ Show 20 Lines • Show All 717 Lines • ▼ Show 20 Lines	if (Len == 1) {
Value *LHSV = B.CreateZExt(B.CreateLoad(castToCStr(LHS, B), "lhsc"),		Value *LHSV = B.CreateZExt(B.CreateLoad(castToCStr(LHS, B), "lhsc"),
CI->getType(), "lhsv");		CI->getType(), "lhsv");
Value *RHSV = B.CreateZExt(B.CreateLoad(castToCStr(RHS, B), "rhsc"),		Value *RHSV = B.CreateZExt(B.CreateLoad(castToCStr(RHS, B), "rhsc"),
CI->getType(), "rhsv");		CI->getType(), "rhsv");
return B.CreateSub(LHSV, RHSV, "chardiff");		return B.CreateSub(LHSV, RHSV, "chardiff");
}		}

// memcmp(S1,S2,N/8)==0 -> ((intN_t)S1 != (intN_t)S2)==0		// memcmp(S1,S2,N/8)==0 -> ((intN_t)S1 != (intN_t)S2)==0
		// TODO: The case where both inputs are constants does not need to be limited
		// to legal integers or equality comparison. See block below this.
if (DL.isLegalInteger(Len * 8) && isOnlyUsedInZeroEqualityComparison(CI)) {		if (DL.isLegalInteger(Len * 8) && isOnlyUsedInZeroEqualityComparison(CI)) {

IntegerType IntType = IntegerType::get(CI->getContext(), Len 8);		IntegerType IntType = IntegerType::get(CI->getContext(), Len 8);
unsigned PrefAlignment = DL.getPrefTypeAlignment(IntType);		unsigned PrefAlignment = DL.getPrefTypeAlignment(IntType);

if (getKnownAlignment(LHS, DL, CI) >= PrefAlignment &&		// First, see if we can fold either argument to a constant.
getKnownAlignment(RHS, DL, CI) >= PrefAlignment) {		Value *LHSV = nullptr;
		if (auto *LHSC = dyn_cast<Constant>(LHS)) {
		LHSC = ConstantExpr::getBitCast(LHSC, IntType->getPointerTo());
		LHSV = ConstantFoldLoadFromConstPtr(LHSC, IntType, DL);
		}
		Value *RHSV = nullptr;
		if (auto *RHSC = dyn_cast<Constant>(RHS)) {
		RHSC = ConstantExpr::getBitCast(RHSC, IntType->getPointerTo());
		RHSV = ConstantFoldLoadFromConstPtr(RHSC, IntType, DL);
		}

		// Don't generate unaligned loads. If either source is constant data,
		// alignment doesn't matter for that source because there is no load.
		if (!LHSV && getKnownAlignment(LHS, DL, CI) >= PrefAlignment) {
Type *LHSPtrTy =		Type *LHSPtrTy =
IntType->getPointerTo(LHS->getType()->getPointerAddressSpace());		IntType->getPointerTo(LHS->getType()->getPointerAddressSpace());
		LHSV = B.CreateLoad(B.CreateBitCast(LHS, LHSPtrTy), "lhsv");
		efriedmaUnsubmitted Not Done Reply Inline Actions Are you sure this is right? It looks like it you'll create a dead load if the LHS is aligned, but the RHS isn't (and therefore drive instcombine into an infinite loop). efriedma: Are you sure this is right? It looks like it you'll create a dead load if the LHS is aligned…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions No - I'm sure this is wrong in exactly the way you've noted. :) Sorry the review didn't get updated here in Phab, but I reverted this version of the patch at rL311340 and recommitted with a fix and extra test case at rL311366. Please let me know if you see any problems there. spatel: No - I'm sure this is wrong in exactly the way you've noted. :) Sorry the review didn't get…
		}

		if (!RHSV && getKnownAlignment(RHS, DL, CI) >= PrefAlignment) {
Type *RHSPtrTy =		Type *RHSPtrTy =
IntType->getPointerTo(RHS->getType()->getPointerAddressSpace());		IntType->getPointerTo(RHS->getType()->getPointerAddressSpace());
		RHSV = B.CreateLoad(B.CreateBitCast(RHS, RHSPtrTy), "rhsv");
		}

Value *LHSV =		if (LHSV && RHSV)
B.CreateLoad(B.CreateBitCast(LHS, LHSPtrTy, "lhsc"), "lhsv");
Value *RHSV =
B.CreateLoad(B.CreateBitCast(RHS, RHSPtrTy, "rhsc"), "rhsv");

return B.CreateZExt(B.CreateICmpNE(LHSV, RHSV), CI->getType(), "memcmp");		return B.CreateZExt(B.CreateICmpNE(LHSV, RHSV), CI->getType(), "memcmp");
}		}
}

// Constant folding: memcmp(x, y, l) -> cnst (all arguments are constant)		// Constant folding: memcmp(x, y, Len) -> constant (all arguments are const).
		// TODO: This is limited to i8 arrays.
StringRef LHSStr, RHSStr;		StringRef LHSStr, RHSStr;
if (getConstantStringInfo(LHS, LHSStr) &&		if (getConstantStringInfo(LHS, LHSStr) &&
getConstantStringInfo(RHS, RHSStr)) {		getConstantStringInfo(RHS, RHSStr)) {
// Make sure we're not reading out-of-bounds memory.		// Make sure we're not reading out-of-bounds memory.
if (Len > LHSStr.size() \|\| Len > RHSStr.size())		if (Len > LHSStr.size() \|\| Len > RHSStr.size())
return nullptr;		return nullptr;
// Fold the memcmp and normalize the result. This way we get consistent		// Fold the memcmp and normalize the result. This way we get consistent
// results across multiple platforms.		// results across multiple platforms.
▲ Show 20 Lines • Show All 1,676 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/memcmp-constant-fold.ll

				; RUN: opt < %s -instcombine -S -data-layout=e-n32 \| FileCheck %s --check-prefix=ALL --check-prefix=LE
				; RUN: opt < %s -instcombine -S -data-layout=E-n32 \| FileCheck %s --check-prefix=ALL --check-prefix=BE

				declare i32 @memcmp(i8, i8, i64)

				; The alignment of this constant does not matter. We constant fold the load.

				@charbuf = private unnamed_addr constant [4 x i8] [i8 0, i8 0, i8 0, i8 1], align 1

				define i1 @memcmp_4bytes_unaligned_constant_i8(i8* align 4 %x) {
				; LE-LABEL: @memcmp_4bytes_unaligned_constant_i8(
				; LE-NEXT: [[TMP1:%.]] = bitcast i8 %x to i32*
				; LE-NEXT: [[LHSV:%.]] = load i32, i32 [[TMP1]], align 4
				; LE-NEXT: [[TMP2:%.*]] = icmp eq i32 [[LHSV]], 16777216
				; LE-NEXT: ret i1 [[TMP2]]
				;
				; BE-LABEL: @memcmp_4bytes_unaligned_constant_i8(
				; BE-NEXT: [[TMP1:%.]] = bitcast i8 %x to i32*
				; BE-NEXT: [[LHSV:%.]] = load i32, i32 [[TMP1]], align 4
				; BE-NEXT: [[TMP2:%.*]] = icmp eq i32 [[LHSV]], 1
				; BE-NEXT: ret i1 [[TMP2]]
				;
				%call = tail call i32 @memcmp(i8* %x, i8* getelementptr inbounds ([4 x i8], [4 x i8]* @charbuf, i64 0, i64 0), i64 4)
				%cmpeq0 = icmp eq i32 %call, 0
				ret i1 %cmpeq0
				}

				; We still don't care about alignment of the constant. We are not limited to constant folding only i8 arrays.
				; It doesn't matter if the constant operand is the first operand to the memcmp.

				@intbuf_unaligned = private unnamed_addr constant [4 x i16] [i16 1, i16 2, i16 3, i16 4], align 1

				define i1 @memcmp_4bytes_unaligned_constant_i16(i8* align 4 %x) {
				; LE-LABEL: @memcmp_4bytes_unaligned_constant_i16(
				; LE-NEXT: [[TMP1:%.]] = bitcast i8 %x to i32*
				; LE-NEXT: [[RHSV:%.]] = load i32, i32 [[TMP1]], align 4
				; LE-NEXT: [[TMP2:%.*]] = icmp eq i32 [[RHSV]], 131073
				; LE-NEXT: ret i1 [[TMP2]]
				;
				; BE-LABEL: @memcmp_4bytes_unaligned_constant_i16(
				; BE-NEXT: [[TMP1:%.]] = bitcast i8 %x to i32*
				; BE-NEXT: [[RHSV:%.]] = load i32, i32 [[TMP1]], align 4
				; BE-NEXT: [[TMP2:%.*]] = icmp eq i32 [[RHSV]], 65538
				; BE-NEXT: ret i1 [[TMP2]]
				;
				%call = tail call i32 @memcmp(i8* bitcast (i16* getelementptr inbounds ([4 x i16], [4 x i16]* @intbuf_unaligned, i64 0, i64 0) to i8), i8 %x, i64 4)
				%cmpeq0 = icmp eq i32 %call, 0
				ret i1 %cmpeq0
				}

				; TODO: Any memcmp where all arguments are constants should be constant folded. Currently, we only handle i8 array constants.

				@intbuf = private unnamed_addr constant [2 x i32] [i32 0, i32 1], align 4

				define i1 @memcmp_3bytes_aligned_constant_i32(i8* align 4 %x) {
				; ALL-LABEL: @memcmp_3bytes_aligned_constant_i32(
				; ALL-NEXT: [[CALL:%.]] = tail call i32 @memcmp(i8 bitcast (i32* getelementptr inbounds ([2 x i32], [2 x i32]* @intbuf, i64 0, i64 1) to i8), i8 bitcast ([2 x i32]* @intbuf to i8*), i64 3)
				; ALL-NEXT: [[CMPEQ0:%.*]] = icmp eq i32 [[CALL]], 0
				; ALL-NEXT: ret i1 [[CMPEQ0]]
				;
				%call = tail call i32 @memcmp(i8* bitcast (i32* getelementptr inbounds ([2 x i32], [2 x i32]* @intbuf, i64 0, i64 1) to i8), i8 bitcast (i32* getelementptr inbounds ([2 x i32], [2 x i32]* @intbuf, i64 0, i64 0) to i8*), i64 3)
				%cmpeq0 = icmp eq i32 %call, 0
				ret i1 %cmpeq0
				}