As discussed in D118534, all of the recent AMD CPUs have relatively fast (<14 cycle latency) "sqrtss" instructions:
https://uops.info/table.html?search=sqrtps&cb_lat=on&cb_tp=on&cb_SNB=on&cb_SKL=on&cb_ZENp=on&cb_ZEN2=on&cb_ZEN3=on&cb_measurements=on&cb_avx=on&cb_sse=on
So we should set this tuning flag to alter codegen of plain "sqrt(X)" expansion (as opposed to reciprocal-sqrt - there is other test coverage for that pattern). The expansion is both slower and less accurate than the hardware instruction.
Please also add TuningFastVectorFSQRT, i don't see any difference between scalar and vector variants at least on znver3.