Here is the result of my tries to make memcpy() inlined in an "optimal" way, which means interleaved load/store pair instructions that use 64-bit registers.
It was suggested to make this in AArch64LoadStoreOptimizer pass, which did work until PostRA Machine Instruction Scheduler was enabled for AArch64 target, hence it became a separate pass that runs after PostRA MISched. The pass is disabled by default, but changes in tests make them pass with and without the pass.
When ldr/str is in the middle they are reordered as well except for cases like:
ldr ldp stp str
which occur only on copying small amount of data and I'm not sure if its worth reordering them to
ldr str ldp stp
but that can be done.
Unfortunately, I don't have AArch64 hardware to run performance test yet so I can't back it up with numbers, but such sequence was claimed to be preferred. At least this gives a way to test it. Or it can just be here for now.
s/af/as