This is a preparatory patch for generating more VLLEZ instructions, although it also has some benefits on its own.
lowerExtendVectorInreg() no longer unpacks ZERO_EXTEND_VECTOR_INREG always, but instead expands to a shuffle with a zero vector.
GeneralShuffle::getNode() now detects and handles cases which involves a zero vector. An advantage to this is that it is now possible to tune when a vperm or unpacking should be used, depending on how many unpacks are needed.
It seemed to work better to use the new UnpackInfo struct to work on the byte level directly, rather than trying to detect byte sequences and from that deduce the unpacks needed.
It is also necessary to handle the cases of a shuffled input source, which must be done cleverly to avoid generating worse code.
An alternative to this - for the purpose of VLLEZ - might be to only expand those potential candidates to shuffles. Expanding all of them reduces the number of vperms generally since more cases can now be handled with unpacks, but there is also need to find the zero vector and make sure to handle it last in getNode().
When realizing that the number of permutes increased at one point I attempted an algorithm to find different orders of shuffling in getNode(). This was however not a simple thing to do since the DAG nodes could then not be created when evaluating an order of the shuffles. I then however found out that putting the new zero vector last handled this problem and now vperms are same or less on all files with this.
On Imagick (this time without ffp-contract=fast), this gave a good improvement *without* VLLEZ. Using the other vllez-patch directly (https://reviews.llvm.org/D76275) gave a slightly bigger improvement. Using this patch and also generating VLLEZs gave the exact same improvement as the first patch. So looking at just this benchmark, I see that this patch is beneficial on its own, but if generating VLLEZ it doesn't really play a role.
Apart from Imagick, I see a 2.5% improvement on x264 if using -max-unpackops=2, which is interesting.
Other notes:
lowerExtendVectorInreg(): All extended bytes are defined to 0, even if the original element was -1. Is this needed, or could it be assumed that a zero-extended undefined element have all bytes undefined as well? Not sure if it matters.
unpacking with a permute: There are a few (140) cases with only two defined elements, where the VPERM of Best.Mask can be done with a VREPI. Most of those cases however required 3 unpacks after the VPERM, and just a dozen could be done with 2 unpacks after. I removed that check since it also did not seem to improve any benchmarks.
New tests: fun0: This particular set of values (i8 compares -> i32) seem to not work on trunk while others do.
Use -debug-only=systemz-lower to see what is going on.
Now that this routine shares no code at all between the sign- and zero-extend case, I think it would make more sense to just have two different routines lowerSIGN_EXTEND_VECTOR_INREG and lowerZERO_EXTEND_VECTOR_INREG.