AArch64 was only setting costs for SK_Transpose, which meant that many of the simpler shuffles (e.g. SK_Select and SK_PermuteSingleSrc for larger vector elements) was being severely overestimated by the default shuffle expansion.
This patch adds costs to help improve SLP performance and avoid a regression in reductions in an upcoming patch that will allow SLP to recognise a lot more SK_Select shuffle patterns.
I'm not very knowledgable about AArch64 shuffle lowering so I've kept the extra costs to a minimum - someone who knows this code can add extra costs which should improve vectorization a lot more.
I'm not sure that the costs of 64 and 128 bits long vectors should be different.