Here is the result of my tries to make memcpy() inlined in an "optimal" way, which means interleaved load/store pair instructions that use 64-bit registers.
It was suggested to make this in AArch64LoadStoreOptimizer pass, which did work until PostRA Machine Instruction Scheduler was enabled for AArch64 target, hence it became a separate pass that runs after PostRA MISched. The pass is disabled by default, but changes in tests make them pass with and without the pass.
When ldr/str is in the middle they are reordered as well except for cases like:
ldr ldp stp str
which occur only on copying small amount of data and I'm not sure if its worth reordering them to
ldr str ldp stp
but that can be done.
Unfortunately, I don't have AArch64 hardware to run performance test yet so I can't back it up with numbers, but such sequence was claimed to be preferred. At least this gives a way to test it. Or it can just be here for now.
The important thing is that we have ldp/stp in that order, ideally with increasing addresses. We don't need to cluster them all together - it's the ordering of memory operations that counts I think.
So we can have:
ldp
stp
add # unrelated operation
ldp
stp
This should be fine, and may be a good thing, depending on the microarchitecture.