Currently, we canonicalize shuffles that produce a result larger than
their operands with:
shuffle(concat(v1, undef), concat(v2, undef))
->
shuffle(concat(v1, v2), undef)
because we can access quad vectors (see PerformVECTOR_SHUFFLECombine).
This is useful in the general case, but there are special cases where
native shuffles produce larger results: the two-result ops.
Look through the concat when lowering them:
shuffle(concat(v1, v2), undef)
->
concat(VZIP(v1, v2):0, :1)
This lets us generate the native shuffles instead of scalarizing to
dozens of VMOVs.
I'm a little worried about the disparity between the lowering and
isShuffleMaskLegal, but with the current API we have no way of looking
at the actual operands, and this isn't a problem in practice because
the ARM combine runs last.
The obvious alternative would be to stop doing the combine, but I
think it's useful. We can also avoid doing it for these masks, but
we'll still need to look through concat(v, undef) to avoid
generating needlessly-wide shuffles.