and (pcmpgt X, -1), Y --> pandn (sra X, BitWidth-1), Y
This avoids the -1 constant vector in favor of an arithmetic shift instruction if it exists (the ISA is still not complete after all these years...).
We catch this pattern late in combining by matching PCMPGT, so it should not interfere with more general folds.
The test diffs are against trunk. This should eliminate lumps in D113426 (if we want to see the cumulative results, I can rebase on top of that).
What's the BW mean? Byte and word? But the tests show for i16 and i32.