X86 memcpy: use REPMOVSB instead of REPMOVS{Q,D,W} for inline copies when the subtarget has fast strings.
This has two advantages:
- Speed is improved. For example, on Haswell throughput improvements increase linearly with size from 256 to 512 bytes, after which they plateau (e.g. +1% for 260 bytes, +25% for 400 bytes, +40% for 508 bytes and larger).
- Code is much smaller (no need to handle boundaries).
Do we want the feature to be as simple as 'fast/slow' or should we take the size of the copy into account as well?