This is an archive of the discontinued LLVM Phabricator instance.

Remove sometimes faulty rewrite of memcpy in instcombine.
ClosedPublic

Authored by uabelho on Feb 22 2017, 6:32 AM.

Download Raw Diff

Details

Reviewers

majnemer
eli.friedman
efriedma

Commits

rG760dc9aba76a: Remove sometimes faulty rewrite of memcpy in instcombine.
rL296585: Remove sometimes faulty rewrite of memcpy in instcombine.

Summary

Solves PR 31990.

The bad rewrite could replace a memcpy of one word with
store i4 -1
while it should actually be
store i8 -1

Hopefully opt and llc has improved enough so the original optimization
done by the code isn't needed anymore.

One already existing testcase is affected. It originally tested that
the memcpy was replaced with
load double
but since we now remove that rewrite it will be
load i64
instead.

Patch suggestion by Eli Friedman.

Diff Detail

Event Timeline

uabelho created this revision.Feb 22 2017, 6:32 AM

What do you think, is it ok to remove the offending code as I've done?

What about the test memcpy-to-load.ll, is the update ok or should I remove it?

efriedma added a subscriber: efriedma.Feb 22 2017, 11:03 AM

efriedma added inline comments.

test/Transforms/InstCombine/memcpy-to-load.ll
15	We already produce this result 64-bit targets (combineLoadToOperationType transforms double load/store to i64 load/store); the interesting piece is what happens on targets where i64 isn't legal. In general, it shouldn't matter what we produce here; we should always eventually rewrite loads/stores to the appropriate register type for the target. And other optimization passes don't really care about the types of loads and stores anyway (SROA is a lot more flexible than it used to be). So if this does in fact expose a performance regression, we should fix it elsewhere. That said, it would be nice if you could give testsuite performance numbers on some 32-bit target to make sure we aren't missing some important optimization.

uabelho added inline comments.Feb 23 2017, 1:35 AM

test/Transforms/InstCombine/memcpy-to-load.ll
15	Ok, for the performance numbers you'd have to give me some guidance then. Is it these tests you want me to run? http://llvm.org/docs/TestingGuide.html#test-suite-quickstart I have a 64b linux ubuntu 14.04 machine that I think I managed to run the test-suite the LNT way on according to: http://llvm.org/docs/lnt/quickstart.html I ran them two times without and then two times with the patch and imported all the results to a database and started looking through the web UI but to be honest I don't know what I'm looking for and I'm quite confused. So, please some guidance to what numbers you want me to dig up.

Is it these tests you want me to run?
http://llvm.org/docs/TestingGuide.html#test-suite-quickstart

Yes, that's right.

In terms of the results, you should be seeing a page like http://llvm.org/perf/db_default/v4/nts/109028?compare_to=109007 . On the sidebar on the left, there are two dates in bold: those are the two runs you're comparing. The differences between the two runs are listed under the heading "Run-Over-Run Changes Detail". The important number is the first one; the percentage change between the two runs; if it's negative (green), there's a performance improvement, and if it's positive (red), there's a performance regression. (If there are no significant changes between the two runs, this section might be empty.)

For this patch, please make sure you're comparing the performance of 32-bit binaries (build with "-m32" in CFLAGS/CXXFLAGS).

Alright!

So I've done four runs of

lnt runtest nt --sandbox SANDBOX --cflag=-m32 --cc <path-to-compiler> --test-suite <path-to-test-suite>

Two (base 1 and base2) on two commits without the patch, where the only difference between the commits is
a white space in CODE_OWNERS.txt.

Then two runs (patch1 and patch2) on two other commits containing the patch, where the only difference
between the commits is the same white space change in CODE_OWNERS.txt

If I compare the two runs without the patch there are Execution Time regressions up to 98.36% and Compile
Time regressions up to 219.18%. A total of: "Performance Regressions 191".

I did the above just to get an indication of how stable the numbers are when nothing significant at all has changed.

I have a hard time realizing if the patch changes anything significantly so I put up a couple of screen shots from
different comparisons from the web UI:
http://imgur.com/a/SVg08

There we have base2 vs base1 which is the two runs without the patch compared.

Then four pictures where the two patch runs are compared to the two base runs.

Can you make anything out of this?

(Btw, with --cflag=-m32 three tests seems to fail:

FAIL: MultiSource/Applications/ClamAV/clamscan.compile_time (1 of 2465)
FAIL: MultiSource/Applications/ClamAV/clamscan.execution_time (494 of 2465)
FAIL: MultiSource/Benchmarks/DOE-ProxyApps-C/XSBench/XSBench.execution_time (495 of 2465)

both with and without the patch. I don't see those FAILs without -m32.)

Those results are much more noisy than they should be... if you're seeing run-to-run performance differences of over a second for the same binary, something has gone very wrong. Please ask on llvm-dev for help with that.

Not sure what's causing the test failures... make sure you're passing --cc, --cxx, --cflags, and --cxxflags?

Ok, with help from llvm-dev I now get more stable benchmark numbers.
I don't if this is still too messy to draw any conclusions from, but
at least it's significantly more stable than before.

I've run the benchmarks with

lnt runtest nt --sandbox SANDBOX --cflag=-m32 --cc <path-to-compiler> --test-suite <path-to-test-suite> --threads 1 --build-threads 12 --benchmarking-only --use-perf=1 --make-param="RUNUNDER=taskset -c 1" --benchmarking-only --multisample=3

lnt runtest nt doesn't have any "--cxxflags" flag so I'm only using
--cflag=-m32.

(The failing tests I got with -m32 seems to be connected with zconf.h,
maybe https://bugs.launchpad.net/ubuntu/+source/zlib/+bug/1155307
I tried the workaround suggested at the end but it failed in some
unclear way then so I simply hope that those three tests aren't vital for
the results we're looking for.)

Base vs base comparison (only improvements):
http://i.imgur.com/6eTlc7H.png

Patch vs base comparison (also only improvements):
http://i.imgur.com/ye8EDSl.png

Ok?

Numbers are good enough; LGTM.

This revision is now accepted and ready to land.Feb 28 2017, 12:56 PM

Closed by commit rL296585: Remove sometimes faulty rewrite of memcpy in instcombine. (authored by uabelho). · Explain WhyFeb 28 2017, 10:57 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineCalls.cpp

67 lines

test/

Transforms/

InstCombine/

memcpy-to-load.ll

6 lines

pr31990_wrong_memcpy.ll

26 lines

Diff 89354

lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
static Type getPromotedType(Type Ty) {		static Type getPromotedType(Type Ty) {
if (IntegerType* ITy = dyn_cast<IntegerType>(Ty)) {		if (IntegerType* ITy = dyn_cast<IntegerType>(Ty)) {
if (ITy->getBitWidth() < 32)		if (ITy->getBitWidth() < 32)
return Type::getInt32Ty(Ty->getContext());		return Type::getInt32Ty(Ty->getContext());
}		}
return Ty;		return Ty;
}		}

/// Given an aggregate type which ultimately holds a single scalar element,
/// like {{{type}}} or [1 x type], return type.
static Type reduceToSingleValueType(Type T) {
while (!T->isSingleValueType()) {
if (StructType *STy = dyn_cast<StructType>(T)) {
if (STy->getNumElements() == 1)
T = STy->getElementType(0);
else
break;
} else if (ArrayType *ATy = dyn_cast<ArrayType>(T)) {
if (ATy->getNumElements() == 1)
T = ATy->getElementType();
else
break;
} else
break;
}

return T;
}

/// Return a constant boolean vector that has true elements in all positions		/// Return a constant boolean vector that has true elements in all positions
/// where the input constant data vector has an element with the sign bit set.		/// where the input constant data vector has an element with the sign bit set.
static Constant getNegativeIsTrueBoolVec(ConstantDataVector V) {		static Constant getNegativeIsTrueBoolVec(ConstantDataVector V) {
SmallVector<Constant *, 32> BoolVec;		SmallVector<Constant *, 32> BoolVec;
IntegerType *BoolTy = Type::getInt1Ty(V->getContext());		IntegerType *BoolTy = Type::getInt1Ty(V->getContext());
for (unsigned I = 0, E = V->getNumElements(); I != E; ++I) {		for (unsigned I = 0, E = V->getNumElements(); I != E; ++I) {
Constant *Elt = V->getElementAsConstant(I);		Constant *Elt = V->getElementAsConstant(I);
assert((isa<ConstantInt>(Elt) \|\| isa<ConstantFP>(Elt)) &&		assert((isa<ConstantInt>(Elt) \|\| isa<ConstantFP>(Elt)) &&
▲ Show 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	unsigned SrcAddrSp =
cast<PointerType>(MI->getArgOperand(1)->getType())->getAddressSpace();		cast<PointerType>(MI->getArgOperand(1)->getType())->getAddressSpace();
unsigned DstAddrSp =		unsigned DstAddrSp =
cast<PointerType>(MI->getArgOperand(0)->getType())->getAddressSpace();		cast<PointerType>(MI->getArgOperand(0)->getType())->getAddressSpace();

IntegerType* IntType = IntegerType::get(MI->getContext(), Size<<3);		IntegerType* IntType = IntegerType::get(MI->getContext(), Size<<3);
Type *NewSrcPtrTy = PointerType::get(IntType, SrcAddrSp);		Type *NewSrcPtrTy = PointerType::get(IntType, SrcAddrSp);
Type *NewDstPtrTy = PointerType::get(IntType, DstAddrSp);		Type *NewDstPtrTy = PointerType::get(IntType, DstAddrSp);

// Memcpy forces the use of i8* for the source and destination. That means		// If the memcpy has metadata describing the members, see if we can get the
// that if you're using memcpy to move one double around, you'll get a cast		// TBAA tag describing our copy.
// from double* to i8*. We'd much rather use a double load+store rather than
// an i64 load+store, here because this improves the odds that the source or
// dest address will be promotable. See if we can find a better type than the
// integer datatype.
Value *StrippedDest = MI->getArgOperand(0)->stripPointerCasts();
MDNode *CopyMD = nullptr;		MDNode *CopyMD = nullptr;
if (StrippedDest != MI->getArgOperand(0)) {
Type *SrcETy = cast<PointerType>(StrippedDest->getType())
->getElementType();
if (SrcETy->isSized() && DL.getTypeStoreSize(SrcETy) == Size) {
// The SrcETy might be something like {{{double}}} or [1 x double]. Rip
// down through these levels if so.
SrcETy = reduceToSingleValueType(SrcETy);

if (SrcETy->isSingleValueType()) {
NewSrcPtrTy = PointerType::get(SrcETy, SrcAddrSp);
NewDstPtrTy = PointerType::get(SrcETy, DstAddrSp);

// If the memcpy has metadata describing the members, see if we can
// get the TBAA tag describing our copy.
if (MDNode *M = MI->getMetadata(LLVMContext::MD_tbaa_struct)) {		if (MDNode *M = MI->getMetadata(LLVMContext::MD_tbaa_struct)) {
if (M->getNumOperands() == 3 && M->getOperand(0) &&		if (M->getNumOperands() == 3 && M->getOperand(0) &&
mdconst::hasa<ConstantInt>(M->getOperand(0)) &&		mdconst::hasa<ConstantInt>(M->getOperand(0)) &&
mdconst::extract<ConstantInt>(M->getOperand(0))->isNullValue() &&		mdconst::extract<ConstantInt>(M->getOperand(0))->isNullValue() &&
M->getOperand(1) &&		M->getOperand(1) &&
mdconst::hasa<ConstantInt>(M->getOperand(1)) &&		mdconst::hasa<ConstantInt>(M->getOperand(1)) &&
mdconst::extract<ConstantInt>(M->getOperand(1))->getValue() ==		mdconst::extract<ConstantInt>(M->getOperand(1))->getValue() ==
Size &&		Size &&
M->getOperand(2) && isa<MDNode>(M->getOperand(2)))		M->getOperand(2) && isa<MDNode>(M->getOperand(2)))
CopyMD = cast<MDNode>(M->getOperand(2));		CopyMD = cast<MDNode>(M->getOperand(2));
}		}
}
}
}

// If the memcpy/memmove provides better alignment info than we can		// If the memcpy/memmove provides better alignment info than we can
// infer, use it.		// infer, use it.
SrcAlign = std::max(SrcAlign, CopyAlign);		SrcAlign = std::max(SrcAlign, CopyAlign);
DstAlign = std::max(DstAlign, CopyAlign);		DstAlign = std::max(DstAlign, CopyAlign);

Value *Src = Builder->CreateBitCast(MI->getArgOperand(1), NewSrcPtrTy);		Value *Src = Builder->CreateBitCast(MI->getArgOperand(1), NewSrcPtrTy);
Value *Dest = Builder->CreateBitCast(MI->getArgOperand(0), NewDstPtrTy);		Value *Dest = Builder->CreateBitCast(MI->getArgOperand(0), NewDstPtrTy);
▲ Show 20 Lines • Show All 3,822 Lines • Show Last 20 Lines

test/Transforms/InstCombine/memcpy-to-load.ll

	; RUN: opt < %s -instcombine -S \| grep "load double"			; RUN: opt < %s -instcombine -S \| FileCheck %s
	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"			target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
	target triple = "i686-apple-darwin8"			target triple = "i686-apple-darwin8"

	define void @foo(double* %X, double* %Y) {			define void @foo(double* %X, double* %Y) {
	entry:			entry:
	%tmp2 = bitcast double* %X to i8*			%tmp2 = bitcast double* %X to i8*
	%tmp13 = bitcast double* %Y to i8*			%tmp13 = bitcast double* %Y to i8*
	call void @llvm.memcpy.p0i8.p0i8.i32(i8* %tmp2, i8* %tmp13, i32 8, i32 1, i1 false)			call void @llvm.memcpy.p0i8.p0i8.i32(i8* %tmp2, i8* %tmp13, i32 8, i32 1, i1 false)
	ret void			ret void
	}			}

				; Make sure that the memcpy has been replace with a load/store of i64
				; CHECK: [[TMP:%[0-9]+]] = load i64
				; CHECK: store i64 [[TMP]]
				efriedmaUnsubmitted Not Done Reply Inline Actions We already produce this result 64-bit targets (combineLoadToOperationType transforms double load/store to i64 load/store); the interesting piece is what happens on targets where i64 isn't legal. In general, it shouldn't matter what we produce here; we should always eventually rewrite loads/stores to the appropriate register type for the target. And other optimization passes don't really care about the types of loads and stores anyway (SROA is a lot more flexible than it used to be). So if this does in fact expose a performance regression, we should fix it elsewhere. That said, it would be nice if you could give testsuite performance numbers on some 32-bit target to make sure we aren't missing some important optimization. efriedma: We already produce this result 64-bit targets (combineLoadToOperationType transforms double…
				uabelhoAuthorUnsubmitted Not Done Reply Inline Actions Ok, for the performance numbers you'd have to give me some guidance then. Is it these tests you want me to run? http://llvm.org/docs/TestingGuide.html#test-suite-quickstart I have a 64b linux ubuntu 14.04 machine that I think I managed to run the test-suite the LNT way on according to: http://llvm.org/docs/lnt/quickstart.html I ran them two times without and then two times with the patch and imported all the results to a database and started looking through the web UI but to be honest I don't know what I'm looking for and I'm quite confused. So, please some guidance to what numbers you want me to dig up. uabelho: Ok, for the performance numbers you'd have to give me some guidance then. Is it these tests…

	declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i32, i1) nounwind			declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i32, i1) nounwind

test/Transforms/InstCombine/pr31990_wrong_memcpy.ll

This file was added.

				; RUN: opt -S -instcombine %s -o - \| FileCheck %s

				; Regression test of PR31990. A memcpy of one byte, copying 0xff, was
				; replaced with a single store of an i4 0xf.

				@g = constant i8 -1

				define void @foo() {
				entry:
				%0 = alloca i8
				%1 = bitcast i8* %0 to i4*
				call void @bar(i4* %1)
				%2 = bitcast i4* %1 to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* %2, i8* @g, i32 1, i32 1, i1 false)
				call void @gaz(i8* %2)
				ret void
				}

				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture writeonly,
				i8* nocapture readonly, i32, i32, i1)
				declare void @bar(i4*)
				declare void @gaz(i8*)

				; The mempcy should be simplified to a single store of an i8, not i4
				; CHECK: store i8 -1
				; CHECK-NOT: store i4 -1