Loading a vector of 4 half-precision FP sometimes results in an LD1
of 2 single-precision FP + a reversal. This results in an incorrect
byte swap due to the conversion from little endian to big endian.
In order to generate the correct byte swap, it is easier to
generate the correct LD1 of 4 half-precision FP, thus avoiding the
subsequent reversal.
I suspect that the problem is actually in these patterns - the pattern on line 5823 (v2i32 -> v4i16) used REV32, but this one (v2i32 -> v4f16) uses REV64. I would assume that these patterns should be the same regardless of whether the lanes are integers or floating-point.