This patch extends the cost model for FMUL operations to consider scalar
FMA opportunities when considering multiplies. If a scalar operation in
the bundle has a single FADD/FSUB user, it is very likely those
instructions can be fused to a single fmuladd/fmulsub operation, which
means the multiply is effectively free.
The patch counts the number of fusable operations in a bundle. If all
entries in the bundle can be fused, it is likely that the resulting
vector instructions can also be fused. In this case, consider both
version free and return the common cost, with a tiny bias towards
vectorizing.
Otherwise just consider the fusable scalar FMULs as free.
This recovers a regression in an application very sensitive to SLP
changes after 65c7cecb13f8d2132a54103903501474083afe8f and overall
improves performance by 10%. There is no other measurable impact on the
other applications in a large proprietary benchmark suite on ARM64.
Excessive SLP vectorization in the presence of scalar FMA opportunities
has also been discussed in D131028 and mentioned by @dmgreen.
D125987 also tries to address a similar issue, but with a focus on
horizontal reductions.
I'd appreciate if someone could give this a test on the X86 side.
The cost-based analysis may lead to the wrong final decision here, need to return something like a flag instead or implement this analysis in TTI. What if the cost of Intrinsic::fmuladd != TCC_Basic?