I had originally made this a FIXME in D7866, but we're attacking the problem from different angles now. If we don't have a target-specific combine on insertps, we need to generate the right code in the first place.
With this patch, for this one exact case, we'll generate:
blendps %xmm0, %xmm1, $1
instead of:
insertps %xmm0, %xmm1, $0
If there's a memory operand available for load folding and we're optimizing for size, we'll still generate the insertps.
The detailed performance data motivation for this may be found in D7866; in summary, blendps has 2-3x throughput vs. insertps on widely used chips.