We are going to remove the old 'perfect shuffle' optimization since it brings performance penalty in hot loop around vectors.
For example, in following loop sharing the same mask:
%v.1 = shufflevector %x.1, %y.1, <0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27> %v.2= shufflevector %x.2, %y.2, <0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27> ...
The instruction result would be:
vmrglw ... vmrghw ... vmrglw ... vmrghw ...
instead of
vperm vperm ...
In some large loop cases, this causes 20%+ performance downgradation. In perfect-shuffle.ll, we also met such situation.
We indeed see some codegen cases are worse when disabling perfect shuffle, so they'll be fixed in a more careful way in future patches.