This is a follow-on to D8833 (insertps optimization when the zero mask is not used).
In this patch, we check for the case where the zmask is used, but both input vectors to the insertps intrinsic are the same operand. This lets us replace the 2nd shuffle input operand with the zero vector.
I confirmed that the x86 backend generates the expected insertps instructions for the shuffles created here.