If we have a vector FP division with a splatted divisor, we can use the existing transform that converts 'x/y' into 'x * (1.0/y)' to allow more conversions. This can then potentially be converted into a scalar FP division by existing combines (rL358984) as seen in the tests here.
That can be a potentially big perf difference if scalar fdiv has better timing (including avoiding possible frequency throttling for vector ops).
There's another diff here in the ordering of the transforms - I'm proposing to move the repeated divisor transform ahead of the reciprocal estimate transform because that seems more likely to produce the best results. For default x86, we don't turn fdiv f32 into an estimate because the estimate accuracy is too poor for most code. That's probably the right perf choice for current and future CPUs since divss throughput is down to the 3-4 cycle range (Skylake/Ryzen).