Try to avoid creating VGBMs by reusing the permutation mask if it contains a zero. If the first byte was into (any byte of) a zero vector, then the first byte of the mask can become zero and reused by putting the mask also as the first operand. If there was a first-byte use of the other source operand, then that zero index can be reused if the mask is placed as the second operand. The first case will be important probably with the upcoming patch for zero_extend_vector_inreg, while the second case affects just a few cases on SPEC'17. The patch could of course be simplified if the second case was skipped...
There is also the case of Intrinsic::s390_vperm, but it seems more complicated to wait until DAGCombine to handle this, since then we have to extract the mask back from a constant-pool or maybe even a SystemZVectorConstant... Besides, the user should be able to do that on his own also, right?
This doesn't seem to have that big of an effect on SPEC'17:
vperm : 21136 21110 -26 vgbm : 11385 11368 -17 vl : 109414 109411 -3 larl : 371345 371347 +2
...
, but it probably becomes more noticeable with the zero_extend_vector_inreg patch... Should it be committed separately first, or rather incorporated into that other patch?
It seems that now MachineCSE can remove a few VPERMs also. At least one case was because instead of using an undef source operand, the mask was used (it doesn't help any to replace an undef Ops[1] with Op2 on the last line of getGeneralPermuteNode(), which I first thought). The test case I have reduced (not included) shows that MCSE on trunk fails to remove the second vperm if even though the instructions are near identical. The only difference to the success with this patch is that instead of the reused mask, there is an IMPLICIT_DEF. It looks to me that this is a minor deficiency of MCSE (Two different vregs defined by IMPLICIT_DEF should not stop CSE, I would think).
isZeroOrUndefVector(): The check for the undef vector used to be beneficial for the zero_extend_vector_inreg patch, but the way it looks now (just using a single unpack), it is not needed for anymore (NFC). I think it could be removed or we can keep it here to get the few less vperms...
Two tests - one for each case handled. Not quite sure what happened with vector-constrained-fp-intrinsics.ll - I don't understand why v0 is no longer used, but it seems that this patch begins to change things when the second operand of the VPERM is undefined.