This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] do not expand 8 byte memcpy if optimising for minsize
AbandonedPublic

Authored by SjoerdMeijer on Sep 14 2018, 2:49 AM.

Details

Summary

Do not expand a 8 byte copy to load/stores when we optimise for minimum
code size. This could for example expand into 2 word loads and
2 stores. But when unaligned data access is not supported, this is a
lot worse and we will have 8 byte loads and 8 byte stores. Keeping the
memcpy call will result in just 2 instructions: the call and a mov imm to
an arg register for the number of bytes to copy.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Sep 14 2018, 2:49 AM
lebedev.ri added a subscriber: lebedev.ri.EditedSep 14 2018, 2:58 AM

Two questions:

  1. Should this be somehow more parametrized, rather than hardcoding magical Size > 4? Some target info?
  2. Is there some reverse transform, that tries to actually form memcpy from such expanded load+store pairs? Should it be disabled too, to avoid looping?

fixed the test case.

It seems a bit odd to not inline an 8byte memcpy on a 64bit system, even at minsize. Perhaps this should be based on whether i64 is a legal type?

Thanks both, those are fair points.

For a bit more context, this is the problem that I am trying to solve:

void foo (char *A, char *B) {
  memcpy(A, B, 8);
}

compiled with -Oz -mno-unaligned-access this results in this disaster:

ldrb	r3, [r1]
ldrb	r4, [r1, #1]
ldrb	r5, [r1, #2]
ldrb	r6, [r1, #3]
ldrb	r1, [r1, #5]
ldrb	lr, [r2, #4]!
ldrb.w	r12, [r2, #2]
ldrb	r2, [r2, #3]
strb	r1, [r0, #5]
strb	r6, [r0, #3]
strb	r5, [r0, #2]
strb	r4, [r0, #1]
strb	r3, [r0]
strb	lr, [r0, #4]!
strb	r2, [r0, #3]
strb.w	r12, [r0, #2]
ldr	r11, [sp], #4

but forgetting about this no-unaligned case, we see that with alignment support the code bloat is already there:

ldr	r2, [r1]
ldr	r1, [r1, #4]
str	r1, [r0, #4]
str	r2, [r0]
bx	lr

So, for the decision making here in InstCombine, which is mostly target independent at the moment, I would like to ignore the whole aligned/unaligned business. And what I want to generate is of course just this:

movs	r2, #8
b	__aeabi_memcpy

Now, surprisingly, this is also what we generate for X86 and AArch64 with -Oz, whereas we would perhaps expect a load and a store on these 64-bit architectures? I don't know why that is not happening, if there is a reason for, and I need to look into that.

Either way, this patch generates the same code, and is consistent with that. And I think the hard coding of size > 4 is mostly inline with the some of these checks already there.

Thanks both, those are fair points.

For a bit more context, this is the problem that I am trying to solve:

void foo (char *A, char *B) {
  memcpy(A, B, 8);
}

compiled with -Oz -mno-unaligned-access this results in this disaster:

ldrb	r3, [r1]
ldrb	r4, [r1, #1]
ldrb	r5, [r1, #2]
ldrb	r6, [r1, #3]
ldrb	r1, [r1, #5]
ldrb	lr, [r2, #4]!
ldrb.w	r12, [r2, #2]
ldrb	r2, [r2, #3]
strb	r1, [r0, #5]
strb	r6, [r0, #3]
strb	r5, [r0, #2]
strb	r4, [r0, #1]
strb	r3, [r0]
strb	lr, [r0, #4]!
strb	r2, [r0, #3]
strb.w	r12, [r0, #2]
ldr	r11, [sp], #4

but forgetting about this no-unaligned case, we see that with alignment support the code bloat is already there:

ldr	r2, [r1]
ldr	r1, [r1, #4]
str	r1, [r0, #4]
str	r2, [r0]
bx	lr

So, for the decision making here in InstCombine, which is mostly target independent at the moment, I would like to ignore the whole aligned/unaligned business. And what I want to generate is of course just this:

movs	r2, #8
b	__aeabi_memcpy

Are you sure you shouldn't be fixing this in the backend?

Now, surprisingly, this is also what we generate for X86 and AArch64 with -Oz, whereas we would perhaps expect a load and a store on these 64-bit architectures? I don't know why that is not happening, if there is a reason for, and I need to look into that.

Either way, this patch generates the same code, and is consistent with that. And I think the hard coding of size > 4 is mostly inline with the some of these checks already there.

I agree with all of the earlier review comments:

  1. The way the code is written currently is not good.
  2. If we're going to expand at all, we should be using the datalayout to decide when that is appropriate (no magic numbers).
  3. If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend - and it's already there. For example, see SelectionDAG::getMemcpy() and MemCpyOptimizer.cpp.

There was a proposal to use datalayout here:
D35035
...but that patch had other problems, and it looks stalled.

lebedev.ri added inline comments.Sep 15 2018, 7:18 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
136

FWIW even this 8 shouldn't be here.
This should be two checks - power of two, and datalayout.

Thanks for the feedback and suggestions! Summarising where we are:

  • I think this WIP patch generates the code that we want, for 32 bit and 64. In the 64-bit backends, the 8 byte memcpy is expandend to just a load and store (see also comments below about the backend dealing with the memcpy). So what I said earlier, that we also generate the libcall for X86 and AArch64 with -Oz, this wasn't true due to a problem in my test.
  • But the main problem now is that we are not happy with the current implementation.

Just checking, and I think we also agree on this, we are saying the same things here:

If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend - and it's already there. For example, see SelectionDAG::getMemcpy() and MemCpyOptimizer.cpp.

Exactly, so we rewrite the library call too early here in InstCombine, and the backend should deal with it, which is what this patch was trying to achieve.

Looks like my approach is going to be this then:

  • first a NFC patch to clean up this existing bit of code using datalayout,
  • then I will try to expand on this.

Thanks for the feedback and suggestions! Summarising where we are:

  • I think this WIP patch generates the code that we want

we ?
Do keep in mind that there is more than one backend, more than one target architecture.

, for 32 bit and 64. In the 64-bit backends, the 8 byte memcpy is expandend to just a load and store (see also comments below about the backend dealing with the memcpy). So what I said earlier, that we also generate the libcall for X86 and AArch64 with -Oz, this wasn't true due to a problem in my test.

  • But the main problem now is that we are not happy with the current implementation.

Just checking, and I think we also agree on this, we are saying the same things here:

If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend - and it's already there. For example, see SelectionDAG::getMemcpy() and MemCpyOptimizer.cpp.

Exactly, so we rewrite the library call too early here in InstCombine, and the backend should deal with it, which is what this patch was trying to achieve.

Looks like my approach is going to be this then:

  • first a NFC patch to clean up this existing bit of code using datalayout,
  • then I will try to expand on this.

Do keep in mind that there is more than one backend, more than one target architecture.

Sure, I've checked x86, ARM and AArch64.

One more thought then on this:

If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend

that means we shouldn't be doing any memcpy lowering at all in InstCombiner::SimplifyAnyMemTransfer. In other words, if I rewrite this patch to actually strip out any memcpy lowering here in InstCombine, would that be an acceptable approach?

One more thought then on this:

If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend

that means we shouldn't be doing any memcpy lowering at all in InstCombiner::SimplifyAnyMemTransfer. In other words, if I rewrite this patch to actually strip out any memcpy lowering here in InstCombine, would that be an acceptable approach?

We don't do nearly as much optimizations in backend, so just not expanding memcpy in the middle end will certainly negatively affect the overall optimizations.
I think the best path forward here is to only derive the maximal size of memcpy that is ok to expand from datalayout

There was a proposal to use datalayout here:
D35035
...but that patch had other problems, and it looks stalled.

^ and i think that patch will do such a change.

Cool, I am going to pick that one up then. Cheers.

SjoerdMeijer abandoned this revision.Sep 19 2018, 6:07 AM

This is done better in D35035 (and is no longer stalled).