This is a follow up on D101524 which:
- simplifies cpu features detection and usage,
 - flattens target dependent optimizations so it's obvious which implementations are generated,
 - provides an implementation targeting the host (march/mtune=native) for the mem* functions,
 - makes sure all implementations are unittested (provided the host can run them).
 
I'd really use -mcpu here though, as it will also enable available features.