This patch adds a target-specific combine to transform insertps nodes into blendi nodes. We just have to check to see if a translation of the immediate mask is possible.
Insertps has less potential throughput than blendps on all x86 chips that I have surveyed. For example on Haswell, we can execute blendps on 3 different ports, but insertps is limited to 1. On Sandybridge, PIledriver, and Bulldozer, it's 2 vs. 1.
Doing this transform also reduces the number of patterns we have to match when optimizing scalar SSE code.