fma reassoc A, B, C --> fadd (fmul A, B), C (when target has no FMA hardware)
C/C++ code may use explicit fma() calls (which become LLVM fma intrinsics in IR) but then gets compiled with -ffast-math or similar.
For targets that do not have FMA hardware, we don't want to go out to the math library for a precise but slow FMA result.
I tried this as a generic DAGCombine, but it caused infinite looping on more than 1 other target, so there's likely some over-reaching fma formation happening.
There's also a potential intersection of strict FP with fast-math here. I'm not sure who should win that fight, so just deferring to current behavior for that case.
These Piledriver checks disappeared. Was that deliberate?