This is an optimized memcpy routine for AArch64 using lessons learned from Arm's Optimized Routines(AOR) memcpy (the aarch64 Advanced SIMD one to be precise).
Benchmarked this on a Neoverse-N1 and it beats the current default implementation for both small and big configurations.
I am not entirely familiar with how the compile-time libc implementation selection works, so as you can see from this patch I've enabled this for 'AArch64', though maybe the community may want to choose different implementations for other AArch64 cores? I believe the non-Advanced SIMD variant of AOR's memcpy has been found to work better for earlier AArch64 cores for instance.
I'm curious to hear how the maintainers/community feel about this.