When lowering predicate permute builtins we incorrectly assume only
the typically "active" bits for the specified element type play a
role with all other bits zero'd. This is not the case because all
bits are significant, with the element type specifying how they
are grouped:
b8 - permute using a block size of 1 bit b16 - permute using a block size of 2 bits b32 - permute using a block size of 4 bits b64 - permute using a block size of 8 bits
The affected builtins are svrev, svtrn1, svtrn2, svuzp1, svuzp2,
svzip1 and svzip2.
This patch adds new intrinsics to support these operations and
changes the builtin lowering code to emit them. The b8 case remains
unchanged because for that operation the existing intrinsics work
as required and their support for other predicate types has been
maintained as useful if only as a way to test the correctness of
their matching ISD nodes that code generation relies on.
Out of interest, is there a good reason to handle the nxv16 pattern case differently in the I multiclass args? Written this way at a glance it looks like it is missing.