4i32 shuffles for single insertions into zero vectors lowers to X86vzmovl which was using (v)blendps - causing domain switch stalls. This patch fixes this by using (v)pblendw instead.
The updated tests on test/CodeGen/X86/sse41.ll still contain a domain stall due to the use of insertps - I'm looking at fixing this in a future patch.
Pre-SSE4.1 targets are still affected by a similar domain stall using movss - we could fix this by using 2 x ( punpckldq XMM, zero ) in series - if people agree I'll make a patch for this as well.