When we generate a horizontal reduction of floating adds fed by a vectorized tree rooted at floating multiplies, we should account for the cost of no longer being able to generate scalar FMAs. Similarly, if we vectorize a list of floating multiplies that each feeds a single floating add, we should again account for this cost.
The first test was reduced from a case where the vectorizable tree looked barely profitable (cost -1) with a horizontal reduction, but produced substantially worse code than allowing the FMAs to be generated. The second test was derived from the first; we again generate a horizontal reduction here, but even if the horizontal reduction is forced to be unprofitable, we try to vectorize the multiplies. I have two follow-up patches to address these issues.
Remove #0