Not folding these cases tends to avoid partial register updates:
sqrtss (%eax), %xmm0
Has a partial update of %xmm0, while
movss (%eax), %xmm0 sqrtss %xmm0, %xmm0
Has a clobber of the high lanes immediately before the partial update, avoiding a potential stall.
Given this, we only want to fold when optimizing for size.
This is consistent with the patterns we already have for the fp/int converts, and in X86InstrInfo::foldMemoryOperandImpl()