We can use GF2P8AFFINEQB to reverse bits in a byte. Shuffles are needed to reverse the bytes in elements larger than i8. LegalizeVectorOps takes care of inserting the shuffle for the larger element size.
We already have Custom lowering for v16i8 with SSSE3, v32i8 with AVX, and v64i8 with AVX512BW. So we only need to mark v16i8 with SSE2+GFNI as Custom.
I think we might be able to use this for scalars too by moving into a vector and back. But I'll save that for a follow up as its a little more involved.
clang-format not found in user's PATH; not linting file.