Scalar FMUL, FMULX instructions perform better or the same compared to indexed
FMUL, FMULX.
For example, the Arm Cortex-A55 Software Optimization Guide lists the following
instructions with a throughput of 2 IPC:
- "FP multiply" FMUL
- "ASIMD FP multiply" FMULX
whereas it lists the following with a throughput of 1 IPC:
- "ASIMD FP multiply, by element" FMUL, FMULX
The Arm Cortex-A510 Software Optimization Guide, however, does not separately
list "by element" variants of the "ASIMD FP multiply" instructions, which are
listed with the same throughput as the non-ASIMD ones.
Fixes #60817.
I think these should be HasNEON too.