// shuffle (concat X, undef), (concat Y, undef), Mask --> // concat (shuffle X, Y, Mask0), (shuffle X, Y, Mask1)
Someone with more ARM NEON experience can confirm, but I think the changes with 'vtrn' are improvements.
The x86 changes look neutral or better. There's one test with an extra instruction, but that could be reversed for a subtarget with the right attributes.
But by default, I think we want to avoid the 256-bit op when possible (in my motivating benchmark, a handful of ymm ops sprinkled into a sequence of xmm ops are triggering frequency throttling on Haswell resulting in signficantly worse perf).