Hi all,
This is an optimized memcpy routine for AArch64 using lessons learned from Arm's Optimized Routines(AOR) memcpy (the aarch64 Advanced SIMD one to be precise).
Benchmarked this on a Neoverse-N1 and it beats the current default implementation for both small and big configurations.
I am not entirely familiar with how the compile-time libc implementation selection works, so as you can see from this patch I've enabled this for 'AArch64', though maybe the community may want to choose different implementations for other AArch64 cores? I believe the non-Advanced SIMD variant of AOR's memcpy has been found to work better for earlier AArch64 cores for instance.
I'm curious to hear how the maintainers/community feel about this.
Kind Regards,
Andre
This change is not needed: it will be handled by the else() clause.
We have a special case for x86 to be able to support 32 and 64 bits architectures with the same code.