This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
1
InstCombineCalls.cpp
-
test/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
-
memcpy-to-load.ll

Differential D52081

[InstCombine] do not expand 8 byte memcpy if optimising for minsize
AbandonedPublic

Authored by SjoerdMeijer on Sep 14 2018, 2:49 AM.

Download Raw Diff

Details

Reviewers

samparker
dmgreen
spatel
dneilson

Summary

Do not expand a 8 byte copy to load/stores when we optimise for minimum
code size. This could for example expand into 2 word loads and
2 stores. But when unaligned data access is not supported, this is a
lot worse and we will have 8 byte loads and 8 byte stores. Keeping the
memcpy call will result in just 2 instructions: the call and a mov imm to
an arg register for the number of bytes to copy.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Sep 14 2018, 2:49 AM

Two questions:

Should this be somehow more parametrized, rather than hardcoding magical Size > 4? Some target info?
Is there some reverse transform, that tries to actually form memcpy from such expanded load+store pairs? Should it be disabled too, to avoid looping?

fixed the test case.

It seems a bit odd to not inline an 8byte memcpy on a 64bit system, even at minsize. Perhaps this should be based on whether i64 is a legal type?

Thanks both, those are fair points.

For a bit more context, this is the problem that I am trying to solve:

void foo (char *A, char *B) {
  memcpy(A, B, 8);
}

compiled with -Oz -mno-unaligned-access this results in this disaster:

ldrb	r3, [r1]
ldrb	r4, [r1, #1]
ldrb	r5, [r1, #2]
ldrb	r6, [r1, #3]
ldrb	r1, [r1, #5]
ldrb	lr, [r2, #4]!
ldrb.w	r12, [r2, #2]
ldrb	r2, [r2, #3]
strb	r1, [r0, #5]
strb	r6, [r0, #3]
strb	r5, [r0, #2]
strb	r4, [r0, #1]
strb	r3, [r0]
strb	lr, [r0, #4]!
strb	r2, [r0, #3]
strb.w	r12, [r0, #2]
ldr	r11, [sp], #4

but forgetting about this no-unaligned case, we see that with alignment support the code bloat is already there:

ldr	r2, [r1]
ldr	r1, [r1, #4]
str	r1, [r0, #4]
str	r2, [r0]
bx	lr

So, for the decision making here in InstCombine, which is mostly target independent at the moment, I would like to ignore the whole aligned/unaligned business. And what I want to generate is of course just this:

movs	r2, #8
b	__aeabi_memcpy

Now, surprisingly, this is also what we generate for X86 and AArch64 with -Oz, whereas we would perhaps expect a load and a store on these 64-bit architectures? I don't know why that is not happening, if there is a reason for, and I need to look into that.

Either way, this patch generates the same code, and is consistent with that. And I think the hard coding of size > 4 is mostly inline with the some of these checks already there.

In D52081#1234686, @SjoerdMeijer wrote:
Thanks both, those are fair points.

For a bit more context, this is the problem that I am trying to solve:
void foo (char *A, char *B) {
  memcpy(A, B, 8);
}
compiled with -Oz -mno-unaligned-access this results in this disaster:
ldrb	r3, [r1]
ldrb	r4, [r1, #1]
ldrb	r5, [r1, #2]
ldrb	r6, [r1, #3]
ldrb	r1, [r1, #5]
ldrb	lr, [r2, #4]!
ldrb.w	r12, [r2, #2]
ldrb	r2, [r2, #3]
strb	r1, [r0, #5]
strb	r6, [r0, #3]
strb	r5, [r0, #2]
strb	r4, [r0, #1]
strb	r3, [r0]
strb	lr, [r0, #4]!
strb	r2, [r0, #3]
strb.w	r12, [r0, #2]
ldr	r11, [sp], #4
but forgetting about this no-unaligned case, we see that with alignment support the code bloat is already there:
ldr	r2, [r1]
ldr	r1, [r1, #4]
str	r1, [r0, #4]
str	r2, [r0]
bx	lr
So, for the decision making here in InstCombine, which is mostly target independent at the moment, I would like to ignore the whole aligned/unaligned business. And what I want to generate is of course just this:
movs	r2, #8
b	__aeabi_memcpy

Are you sure you shouldn't be fixing this in the backend?

Now, surprisingly, this is also what we generate for X86 and AArch64 with -Oz, whereas we would perhaps expect a load and a store on these 64-bit architectures? I don't know why that is not happening, if there is a reason for, and I need to look into that.

Either way, this patch generates the same code, and is consistent with that. And I think the hard coding of size > 4 is mostly inline with the some of these checks already there.

I agree with all of the earlier review comments:

The way the code is written currently is not good.
If we're going to expand at all, we should be using the datalayout to decide when that is appropriate (no magic numbers).
If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend - and it's already there. For example, see SelectionDAG::getMemcpy() and MemCpyOptimizer.cpp.

There was a proposal to use datalayout here:
D35035
...but that patch had other problems, and it looks stalled.

lebedev.ri added inline comments.Sep 15 2018, 7:18 AM

lib/Transforms/InstCombine/InstCombineCalls.cpp
136	FWIW even this `8` shouldn't be here. This should be two checks - power of two, and datalayout.

Thanks for the feedback and suggestions! Summarising where we are:

I think this WIP patch generates the code that we want, for 32 bit and 64. In the 64-bit backends, the 8 byte memcpy is expandend to just a load and store (see also comments below about the backend dealing with the memcpy). So what I said earlier, that we also generate the libcall for X86 and AArch64 with -Oz, this wasn't true due to a problem in my test.
But the main problem now is that we are not happy with the current implementation.

Just checking, and I think we also agree on this, we are saying the same things here:

If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend - and it's already there. For example, see SelectionDAG::getMemcpy() and MemCpyOptimizer.cpp.

Exactly, so we rewrite the library call too early here in InstCombine, and the backend should deal with it, which is what this patch was trying to achieve.

Looks like my approach is going to be this then:

first a NFC patch to clean up this existing bit of code using datalayout,
then I will try to expand on this.

In D52081#1236517, @SjoerdMeijer wrote:

Thanks for the feedback and suggestions! Summarising where we are:

I think this WIP patch generates the code that we want

we ?
Do keep in mind that there is more than one backend, more than one target architecture.

, for 32 bit and 64. In the 64-bit backends, the 8 byte memcpy is expandend to just a load and store (see also comments below about the backend dealing with the memcpy). So what I said earlier, that we also generate the libcall for X86 and AArch64 with -Oz, this wasn't true due to a problem in my test.

But the main problem now is that we are not happy with the current implementation.

Just checking, and I think we also agree on this, we are saying the same things here:

If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend - and it's already there. For example, see SelectionDAG::getMemcpy() and MemCpyOptimizer.cpp.

Exactly, so we rewrite the library call too early here in InstCombine, and the backend should deal with it, which is what this patch was trying to achieve.

Looks like my approach is going to be this then:

first a NFC patch to clean up this existing bit of code using datalayout,

then I will try to expand on this.

Do keep in mind that there is more than one backend, more than one target architecture.

Sure, I've checked x86, ARM and AArch64.

One more thought then on this:

If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend

that means we shouldn't be doing any memcpy lowering at all in InstCombiner::SimplifyAnyMemTransfer. In other words, if I rewrite this patch to actually strip out any memcpy lowering here in InstCombine, would that be an acceptable approach?

In D52081#1236723, @SjoerdMeijer wrote:

One more thought then on this:

If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend

that means we shouldn't be doing any memcpy lowering at all in InstCombiner::SimplifyAnyMemTransfer. In other words, if I rewrite this patch to actually strip out any memcpy lowering here in InstCombine, would that be an acceptable approach?

We don't do nearly as much optimizations in backend, so just not expanding memcpy in the middle end will certainly negatively affect the overall optimizations.
I think the best path forward here is to only derive the maximal size of memcpy that is ok to expand from datalayout

In D52081#1235926, @spatel wrote:

There was a proposal to use datalayout here:
D35035
...but that patch had other problems, and it looks stalled.

^ and i think that patch will do such a change.

Cool, I am going to pick that one up then. Cheers.

This is done better in D35035 (and is no longer stalled).

spatel mentioned this in D35035: [InstCombine] Prevent memcpy generation for small data size.Oct 2 2018, 8:44 AM

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineCalls.cpp

10 lines

test/

Transforms/

InstCombine/

memcpy-to-load.ll

12 lines

Diff 165457

lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines	Instruction InstCombiner::SimplifyAnyMemTransfer(AnyMemTransferInst MI) {

// Source and destination pointer types are always "i8*" for intrinsic. See		// Source and destination pointer types are always "i8*" for intrinsic. See
// if the size is something we can handle with a single primitive load/store.		// if the size is something we can handle with a single primitive load/store.
// A single load+store correctly handles overlapping memory in the memmove		// A single load+store correctly handles overlapping memory in the memmove
// case.		// case.
uint64_t Size = MemOpLength->getLimitedValue();		uint64_t Size = MemOpLength->getLimitedValue();
assert(Size && "0-sized memory transferring should be removed already.");		assert(Size && "0-sized memory transferring should be removed already.");

		// Do not expand a 8 byte copy to load/stores when we optimise for minimum
		// code size. This could for example expand into 2 word loads and
		// 2 stores. But when unaligned data access is not supported, this is a
		// lot worse and we will have 8 byte loads and 8 byte stores. Keeping the
		// memcpy call will result in just 2 instructions: the call and a mov imm to
		// an arg register for the number of bytes to copy.
		auto F = MI->getParent()->getParent();
		if (Size > 4 && F->hasFnAttribute(Attribute::MinSize))
		return nullptr;

if (Size > 8 \|\| (Size&(Size-1)))		if (Size > 8 \|\| (Size&(Size-1)))
lebedev.riUnsubmitted Not Done Reply Inline Actions FWIW even this `8` shouldn't be here. This should be two checks - power of two, and datalayout. lebedev.ri: FWIW even this `8` shouldn't be here. This should be two checks - power of two, and datalayout.
return nullptr; // If not 1/2/4/8 bytes, exit.		return nullptr; // If not 1/2/4/8 bytes, exit.

// Use an integer load+store unless we can find something better.		// Use an integer load+store unless we can find something better.
unsigned SrcAddrSp =		unsigned SrcAddrSp =
cast<PointerType>(MI->getArgOperand(1)->getType())->getAddressSpace();		cast<PointerType>(MI->getArgOperand(1)->getType())->getAddressSpace();
unsigned DstAddrSp =		unsigned DstAddrSp =
cast<PointerType>(MI->getArgOperand(0)->getType())->getAddressSpace();		cast<PointerType>(MI->getArgOperand(0)->getType())->getAddressSpace();

▲ Show 20 Lines • Show All 4,437 Lines • Show Last 20 Lines

test/Transforms/InstCombine/memcpy-to-load.ll

	Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	; ALL-NEXT: [[TMP3:%.]] = load i64, i64 [[TMP1]], align 1			; ALL-NEXT: [[TMP3:%.]] = load i64, i64 [[TMP1]], align 1
	; ALL-NEXT: store i64 [[TMP3]], i64* [[TMP2]], align 1			; ALL-NEXT: store i64 [[TMP3]], i64* [[TMP2]], align 1
	; ALL-NEXT: ret void			; ALL-NEXT: ret void
	;			;
	call void @llvm.memcpy.p0i8.p0i8.i32(i8* %d, i8* %s, i32 8, i1 false)			call void @llvm.memcpy.p0i8.p0i8.i32(i8* %d, i8* %s, i32 8, i1 false)
	ret void			ret void
	}			}

				; Do not expand the call if we optimise for minsize
				define void @copy_8_bytes_minsize(i8* %A, i8* %B) #0 {
				; ALL-LABEL: @copy_8_bytes_minsize(
				; ALL: tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 %A, i8* align 1 %B, i32 8, i1 false)
				; ALL-NEXT: ret void
				entry:
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 %A, i8* align 1 %B, i32 8, i1 false)
				ret void
				}

	define void @copy_16_bytes(i8* %d, i8* %s) {			define void @copy_16_bytes(i8* %d, i8* %s) {
	; ALL-LABEL: @copy_16_bytes(			; ALL-LABEL: @copy_16_bytes(
	; ALL-NEXT: call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 [[D:%.]], i8 align 1 [[S:%.*]], i32 16, i1 false)			; ALL-NEXT: call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 [[D:%.]], i8 align 1 [[S:%.*]], i32 16, i1 false)
	; ALL-NEXT: ret void			; ALL-NEXT: ret void
	;			;
	call void @llvm.memcpy.p0i8.p0i8.i32(i8* %d, i8* %s, i32 16, i1 false)			call void @llvm.memcpy.p0i8.p0i8.i32(i8* %d, i8* %s, i32 16, i1 false)
	ret void			ret void
	}			}

				attributes #0 = { minsize nounwind optsize }