Register forwarding hazards might occur when one uop reads a D- or
Q-register operand that has recently been written with one or more
S-register results. This happens only in AArch32 state on Cortex-A57,
Cortex-A72, Cortex-A77 (and probably other processors as well).
See Cortex-A72 Software Optimization Guide s4.4 "Register Forwarding Hazards"
for more details.
The pass replaces S-registers writes with the corresponding scalar
writes to D-registers. If there is no suitable replacement, an
S-register is copied to a D-register scalar via a core register.
The pass is disabled by default and it can be enabled by
-arm-subreg-write LLVM option when non-zero optimization level is set.
With this optimization, llvm-test-suite/MultiSource/Benchmarks/Bullet
shows ~10% performance improvement on Cortex-A72.
The pass has also been tested on Skia library. Skia's nanobench shows
~1.3% geomean improvement and ~10% improvement for some subtests on