When the "tune-cpu" attribute is not set or is set to "x86-64", we currently use an approximation sequence (when permitted) rather than the sqrtss instruction. Since this instruction is available with the default x86-64 ISA and is more accurate, it is better to assume fast sqrt by default, as we do with "tune-cpu"="generic".
The clang front end sets "tune-cpu"="generic" if no tuning or target processor is specifically requested, but other front ends that set "target-cpu"="x86-64" will get the "x86-64" tuning, which is different from "generic".
I've also started a discussion of this here: https://discourse.llvm.org/t/fast-scalar-fsqrt-tuning-in-x86/63605
add explicit triple for cases when its run on other arch