Patch to evaluate fmul+fadd -> fmadd combines and similar code sequences in the
machine combiner. It adds support for float and double similar to the existing
integer implmentation. The key features of this patch are:
- DAGCombiner checks whether it should combine greedily or let the machine combiner do the evaluation. This is (initially) only supported for ARM64.
- It gives preference to throughput over latency: the heuristic used is to combine always in loops, but this can be chosen by the target. For in-order cores latency over throughput might be the better choice
- Support for fmadd, f(n)msub, fmla, fmls
- On by default at O3 fast-math
- Performance: (mostly) single digits gains on kernels and SPEC2006 fp
For the future: the pattern list is starting to grow quite large. I wonder if in the future we should consider moving the MachineCombinerPatterns to be table-generated?