The perfect shuffle (only enabled in big endian yet) may transform a shuffle vector into multiple merge/inserts, but when the shuffle mask is shared between multiple shuffles, it's better to use a single load with multiple vperm.
An obvious blocker is the mask is not operand of vector_shuffle in DAG, so I have to record all masks and check number of uses of each mask.
clang-format: please reformat the code