This adds the vgetq_lane and vsetq_lane families, to copy between
a scalar and a specified lane of a vector.
One of the new vgetq_lane intrinsics returns a float16_t, which
causes a compile error if %clang_cc1 doesn't get the option
-fallow-half-arguments-and-returns. The driver passes that option to
cc1 already, but I've had to edit all the explicit cc1 command lines
in the existing MVE intrinsics tests.
A couple of fixes are included for the code I wrote up front in
MveEmitter to support lane-index immediates (and which nothing has
tested until now): the type was wrong (uint32_t instead of int)
and the range was off by one.
I've also added a method of bypassing the default promotion to i32
that is done by the MveEmitter code generation: it's sensible to
promote short scalars like i16 to i32 if they're going to be
passed to custom IR intrinsics representing a machine instruction
operating on GPRs, but not if they're going to be passed to standard
IR operations like insertelement which expect the exact type.