Extension to D39729 which performed this for vXi16, with the same bit flipping to handle SMAX/SMIN/UMAX cases, vXi8 UMIN horizontal reductions can be performed.
This makes use of the fact that by performing a pair-wise i8 SHUFFLE_UMIN before PHMINPOSUW, we both get the UMIN of each pair but also zero-extend the upper bits ready for v8i16.
Shouldn't this just be getSizeInBits()? It's already scalar.