Given that INSERT_VECTOR_ELT operates on D registers anyway, combining 64-bit vectors into a 128-bit vector is basically free. Therefore, try to split BUILD_VECTOR nodes before giving up and lowering them to a series of INSERT_VECTOR_ELT instructions. Sometimes this allows dramatically better lowerings; see testcases for examples. Inspired by similar code in the x86 backend for AVX.
For the @vcombine_vdup, I'm not happy with the DAGCombine transforms which produce a BUILD_VECTOR in the first place; we're taking splat shuffles which were carefully preserved in the IR, and destroying them in DAGCombine by transforming concat_vec(splat(a), splat(a)) -> concat_vec(build_vector(a,a,a,a), build_vector(b,b,b,b)) -> build_vector(a,a,a,a,b,b,b,b). Maybe we could improve this somehow?
I'm assuming LowerBUILD_VECTOR can return SDNode() if there's no need for it, thus the concat only happening if both were lowered.
Is this is your strategy around "we might discover a better way to lower it"?
How can we avoid cases where it doesn't?