Do not expand a 8 byte copy to load/stores when we optimise for minimum
code size. This could for example expand into 2 word loads and
2 stores. But when unaligned data access is not supported, this is a
lot worse and we will have 8 byte loads and 8 byte stores. Keeping the
memcpy call will result in just 2 instructions: the call and a mov imm to
an arg register for the number of bytes to copy.
Details
Diff Detail
Event Timeline
Two questions:
- Should this be somehow more parametrized, rather than hardcoding magical Size > 4? Some target info?
- Is there some reverse transform, that tries to actually form memcpy from such expanded load+store pairs? Should it be disabled too, to avoid looping?
It seems a bit odd to not inline an 8byte memcpy on a 64bit system, even at minsize. Perhaps this should be based on whether i64 is a legal type?
Thanks both, those are fair points.
For a bit more context, this is the problem that I am trying to solve:
void foo (char *A, char *B) { memcpy(A, B, 8); }
compiled with -Oz -mno-unaligned-access this results in this disaster:
ldrb r3, [r1] ldrb r4, [r1, #1] ldrb r5, [r1, #2] ldrb r6, [r1, #3] ldrb r1, [r1, #5] ldrb lr, [r2, #4]! ldrb.w r12, [r2, #2] ldrb r2, [r2, #3] strb r1, [r0, #5] strb r6, [r0, #3] strb r5, [r0, #2] strb r4, [r0, #1] strb r3, [r0] strb lr, [r0, #4]! strb r2, [r0, #3] strb.w r12, [r0, #2] ldr r11, [sp], #4
but forgetting about this no-unaligned case, we see that with alignment support the code bloat is already there:
ldr r2, [r1] ldr r1, [r1, #4] str r1, [r0, #4] str r2, [r0] bx lr
So, for the decision making here in InstCombine, which is mostly target independent at the moment, I would like to ignore the whole aligned/unaligned business. And what I want to generate is of course just this:
movs r2, #8 b __aeabi_memcpy
Now, surprisingly, this is also what we generate for X86 and AArch64 with -Oz, whereas we would perhaps expect a load and a store on these 64-bit architectures? I don't know why that is not happening, if there is a reason for, and I need to look into that.
Either way, this patch generates the same code, and is consistent with that. And I think the hard coding of size > 4 is mostly inline with the some of these checks already there.
Are you sure you shouldn't be fixing this in the backend?
Now, surprisingly, this is also what we generate for X86 and AArch64 with -Oz, whereas we would perhaps expect a load and a store on these 64-bit architectures? I don't know why that is not happening, if there is a reason for, and I need to look into that.
Either way, this patch generates the same code, and is consistent with that. And I think the hard coding of size > 4 is mostly inline with the some of these checks already there.
I agree with all of the earlier review comments:
- The way the code is written currently is not good.
- If we're going to expand at all, we should be using the datalayout to decide when that is appropriate (no magic numbers).
- If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend - and it's already there. For example, see SelectionDAG::getMemcpy() and MemCpyOptimizer.cpp.
There was a proposal to use datalayout here:
D35035
...but that patch had other problems, and it looks stalled.
lib/Transforms/InstCombine/InstCombineCalls.cpp | ||
---|---|---|
136 | FWIW even this 8 shouldn't be here. |
Thanks for the feedback and suggestions! Summarising where we are:
- I think this WIP patch generates the code that we want, for 32 bit and 64. In the 64-bit backends, the 8 byte memcpy is expandend to just a load and store (see also comments below about the backend dealing with the memcpy). So what I said earlier, that we also generate the libcall for X86 and AArch64 with -Oz, this wasn't true due to a problem in my test.
- But the main problem now is that we are not happy with the current implementation.
Just checking, and I think we also agree on this, we are saying the same things here:
If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend - and it's already there. For example, see SelectionDAG::getMemcpy() and MemCpyOptimizer.cpp.
Exactly, so we rewrite the library call too early here in InstCombine, and the backend should deal with it, which is what this patch was trying to achieve.
Looks like my approach is going to be this then:
- first a NFC patch to clean up this existing bit of code using datalayout,
- then I will try to expand on this.
we ?
Do keep in mind that there is more than one backend, more than one target architecture.
, for 32 bit and 64. In the 64-bit backends, the 8 byte memcpy is expandend to just a load and store (see also comments below about the backend dealing with the memcpy). So what I said earlier, that we also generate the libcall for X86 and AArch64 with -Oz, this wasn't true due to a problem in my test.
- But the main problem now is that we are not happy with the current implementation.
Just checking, and I think we also agree on this, we are saying the same things here:
If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend - and it's already there. For example, see SelectionDAG::getMemcpy() and MemCpyOptimizer.cpp.
Exactly, so we rewrite the library call too early here in InstCombine, and the backend should deal with it, which is what this patch was trying to achieve.
Looks like my approach is going to be this then:
- first a NFC patch to clean up this existing bit of code using datalayout,
- then I will try to expand on this.
Do keep in mind that there is more than one backend, more than one target architecture.
Sure, I've checked x86, ARM and AArch64.
One more thought then on this:
If we want to distinguish optimizing for size from optimizing for speed, that belongs in the backend
that means we shouldn't be doing any memcpy lowering at all in InstCombiner::SimplifyAnyMemTransfer. In other words, if I rewrite this patch to actually strip out any memcpy lowering here in InstCombine, would that be an acceptable approach?
We don't do nearly as much optimizations in backend, so just not expanding memcpy in the middle end will certainly negatively affect the overall optimizations.
I think the best path forward here is to only derive the maximal size of memcpy that is ok to expand from datalayout
^ and i think that patch will do such a change.
FWIW even this 8 shouldn't be here.
This should be two checks - power of two, and datalayout.