FeatureSplatVFPToNeon is on for Cortex-A15 and Exynos. The documentation for Cortex-A57 [1] and Cortex-A72 [2] suggests it would also be beneficial there. I put together a patch taking the obvious route towards adding FeatureSplatVFPToNeon to the features for those cores, but in looking at the root-cause of code generation that requires that pass to be enabled, came to a less intrusive solution, that looks more suitable for generic compilation.
The pass attempts to remove instances where a VFP register is written an an S register, and read as a D register (see the section Register Forwarding Hazards in the linked software optimisation guides).
By far the biggest contributor to instances of "write S, read D" is an optimisation applied to run VMAX through Neon; as so:
define float @max_f32(float, float) { %3 = call nnan float @llvm.maxnum.f32(float %1, float %0) ret float %3 } max_f32: vmov.f32 s2, s1 vmax.f32 d0, d1, d0
Rather than propose FeatureSplatVFPToNeon for generic to work around this codegen, I'd instead like to ask whether we should just avoid this codegen in the first place, by disabling the lowering to Neon unless we're under UseNEONForSinglePrecisionFP.
Note that this patch is only applicable to Armv7-A 32-bit targets; when Armv-8-A is enabled, the single precision VMAXNM instruction can be used.
This patch implements that for 32-bit floats, but leaves 16-bit floats alone - they exist after Armv8.2-A, which none of Cortex-A15, Cortex-A57 or Cortex-A72 implement.
I've validated that this gives performance improvements on Cortex-A57 and Cortex-A72 similar to that you get by turning on FeatureSplatVFPToNeon, and also validated this change against Cortex-A53, where I saw no performance difference. Across a larger range of benchmarks performance came out even on Cortex-A76, with one >5% regression.
If this looks Ok, I'd appreciate someone applying it for me, as I have no commit rights.
[1]: http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf
[2]: https://static.docs.arm.com/uan0016/a/cortex_a72_software_optimization_guide_external.pdf
If we do this for f32, we should presumably do the same for f16.