As reported on PR26235, we don't currently make use of the VBROADCASTF128/VBROADCASTI128 instructions to load+splat a 128-bit vector to both lanes of a 256-bit vector.
This patch enables lowering from subvector insertion/concatenation patterns and auto-upgrades the llvm.x86.avx.vbroadcastf128.pd.256 / llvm.x86.avx.vbroadcastf128.ps.256 intrinsics to match.
Once this is in place I can update _mm256_broadcast_ps and _mm256_broadcast_pd in the headers to use generic IR and remove the clang builtins.
We could possibly investigate using VBROADCASTF128/VBROADCASTI128 to load repeated constants as well (similar to how we already do for scalar broadcasts).
The selected instruction here is from AVX2 set. The same patterns should be added to AVX-512.