On FMA targets, we can avoid having to load a constant to negate a float/double multiply by instead using a FNMSUB (-(X*Y)-0)
Note: As Sanjay mentioned in his bug report, although this is consistently faster (by avoiding the constant load), this does increase register pressure by requiring us to create a zero register. I'm not sure how best to qualify this if people think its a problem. Only running with optsize doesn't really help us - we MAY reduce constantpool size (if no other FNEG are present) but we MAY also increase code size handling extra stack traffic. We do have precedent for this: we use blendps to zero out elements instead of using the slower insertps; I'm sure there are plenty of other examples.
Fix for PR24366