The scalar variant with GPR source/dest has considerably higher latency
than the SIMD&FP scalar variant across a variety of micro-architectures:
Core Scalar SIMD&FP -------------------------------- Neoverse V1 9 cyc 3 cyc Neoverse N2 8 cyc 3 cyc Cortex A510 8 cyc 4 cyc
Maybe put a bit of the explanation you just gave into a comment here, for reference.