This is an archive of the discontinued LLVM Phabricator instance.

[Arm] Do not lower vmax/vmin to Neon instructions
ClosedPublic

Authored by jgreenhalgh on Feb 27 2020, 5:53 AM.

Details

Summary

FeatureSplatVFPToNeon is on for Cortex-A15 and Exynos. The documentation for Cortex-A57 [1] and Cortex-A72 [2] suggests it would also be beneficial there. I put together a patch taking the obvious route towards adding FeatureSplatVFPToNeon to the features for those cores, but in looking at the root-cause of code generation that requires that pass to be enabled, came to a less intrusive solution, that looks more suitable for generic compilation.

The pass attempts to remove instances where a VFP register is written an an S register, and read as a D register (see the section Register Forwarding Hazards in the linked software optimisation guides).

By far the biggest contributor to instances of "write S, read D" is an optimisation applied to run VMAX through Neon; as so:

define float @max_f32(float, float) {
    %3 = call nnan float @llvm.maxnum.f32(float %1, float %0)
    ret float %3
}

max_f32:
    vmov.f32        s2, s1
    vmax.f32        d0, d1, d0

Rather than propose FeatureSplatVFPToNeon for generic to work around this codegen, I'd instead like to ask whether we should just avoid this codegen in the first place, by disabling the lowering to Neon unless we're under UseNEONForSinglePrecisionFP.

Note that this patch is only applicable to Armv7-A 32-bit targets; when Armv-8-A is enabled, the single precision VMAXNM instruction can be used.

This patch implements that for 32-bit floats, but leaves 16-bit floats alone - they exist after Armv8.2-A, which none of Cortex-A15, Cortex-A57 or Cortex-A72 implement.

I've validated that this gives performance improvements on Cortex-A57 and Cortex-A72 similar to that you get by turning on FeatureSplatVFPToNeon, and also validated this change against Cortex-A53, where I saw no performance difference. Across a larger range of benchmarks performance came out even on Cortex-A76, with one >5% regression.

If this looks Ok, I'd appreciate someone applying it for me, as I have no commit rights.


[1]: http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf
[2]: https://static.docs.arm.com/uan0016/a/cortex_a72_software_optimization_guide_external.pdf

Diff Detail

Event Timeline

jgreenhalgh created this revision.Feb 27 2020, 5:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 27 2020, 5:53 AM

I think this is probably OK, I'm not against it, but you could easily argue it either way. If we have an option called useNEONForSinglePrecisionFP, we should probably stick to it. Even if it's not universally applicable. So long as not-one else objects.

We should keep it consistent for FP16 as well though.

llvm/lib/Target/ARM/ARMISelLowering.cpp
1422–1423

If we do this for f32, we should presumably do the same for f16.

Is there also a correctness issue here? I just briefly looked at the ARM manual, and I think vmax flushes denormals.

Update to also cover float16 as per David's request.

Thanks. LGTM, as as you said I will commit this shortly.

dmgreen accepted this revision.Mar 10 2020, 3:50 AM
This revision is now accepted and ready to land.Mar 10 2020, 3:50 AM
This revision was automatically updated to reflect the committed changes.