The VINSERTF128 instruction is often much quicker, and never slower, than the more general VPERM2F128 instruction, so we should try to use that in more circumstances.
This requires a fallback to a commuted VPERM2F128 for the case where we need to fold the 256-bit vector source instead of the 128-bit subvector source.
There is one interesting side effect - DAGCombine's narrowExtractedVectorLoad combine gets called in a number of locations, this always creates an extracted subvector load without regard to other uses of the original wider load. I'm expecting AVX cpus to be capable of merging such aliased loads, but I do wonder whether narrowExtractedVectorLoad's call to X86TargetLowering::shouldReduceLoadWidth needs to be altered to check for more partial uses?
Noticed while investigating the quality of interleaved load/store codegen.
clang-format: please reformat the code