This is a reimplementation of D9780 at the machine instruction level rather than the DAG.
I'm using the MachineCombiner pass to reassociate scalar single-precision AVX additions (just a starting point; see the TODO comment) to increase ILP when it's safe to do so.
The code is closely based on the existing MachineCombiner optimization that is implemented for AArch64.
I tried the test cases that Mehdi provided in the follow-up thread for r236031, and I don't see any instruction count increase. In the massive test case (~8000 machine instructions) that blew up previously, this optimization will fire ~2000 times. This causes a horrible compile-time increase:
$ time llc -enable-unsafe-fp-math -x86-machine-combiner=0 -mattr=avx spill.ll 1.8 sec $ time llc -enable-unsafe-fp-math -x86-machine-combiner=1 -mattr=avx spill.ll 35.8 sec
For now, I'm assuming that this is a degenerate test case (for x86 at least) that we don't need to artificially limit, but we could also clip the optimization to only fire N times.