For BITREVERSE, bit shifting/masking every bit in a vector element is a very lengthy procedure.
If the input vector type is a whole multiple of bytes wide then we can split this into a BSWAP shuffle stage (to reverse at the byte level) and then a BITREVERSE stage applied to each byte. Most vector capable targets can efficiently BSWAP using shuffles resulting in a considerable reduction in instructions.
With this patch targets would only need to implement a target specific vXi8 BITREVERSE implementation to efficiently reverse most legal vector types (I'm intending to add a SSSE3 PSHUFB implementation as an example).
James - This has definitely been tested on AARCH64 this time, at the moment I've just added a token rev32 CHECK but I can expand this to a more thorough set of CHECKs if you'd prefer? I know the x86 approach of scripted 'full' CHECKs isn't always popular ;-)