The perfect shuffle (only enabled in big endian yet) may transform a shuffle vector into multiple merge/inserts, but when the shuffle mask is shared between multiple shuffles, it's better to use a single load with multiple vperm.
An obvious blocker is the mask is not operand of vector_shuffle in DAG, so I have to record all masks and check number of uses of each mask.
clang-format: please reformat the code
- for (const SDNode &Node: DAG.allnodes()) { + for (const SDNode &Node : DAG.allnodes()) {