When lowering the get.active.lane.mask intrinsic with a fixed-width
predicate vector result, we can actually make use of the SVE whilelo
instruction when SVE is enabled. We do this by carefully choosing
a sensible VT for the whilelo instruction, then promoting it to an
integer vector, i.e. nxv16i1 -> nx16i8. We can then extract a v16i8
subvector and truncate back to the original return type, i.e. v16i1.
This leads to a significant improvement in code quality. Also, you can
see in tests such as lane_mask_v8i1_i32 that by choosing the right
scalable VT for the whilelo instruction we no longer see the
"xtn v0.8b, v0.8h" instruction. This is because for NEON v8i1 gets
promoted to v8i8, rather than v8i16, and so the natural element type
to choose is i8.