If we're doing a v4f32 shuffle on x86 with SSE4.1, we can lower certain
shufflevectors to an insertps instruction:
When most of the shufflevector result's elements come from one vector (and
keep their index), and one element comes from another vector or a memory
operand.
Added tests for insertps optimizations on shufflevector.
I would change this into: