Try to widen element type to get a new mask value for a better permutation
sequence, so that we can use NEON shuffle instructions, such as zip1/2,
UZP1/2, TRN1/2, REV, INS, etc.
For example:
`shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 6, i32 7, i32 2, i32 3>`
is equivalent to:
`shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 1>`
Finally, we can get:
`mov v0.d[0], v1.d[1]`
Perhaps add a comment explaining the function.