As reported on PR26235, we don't currently make use of the VBROADCASTF128/VBROADCASTI128 instructions to load+splat a 128-bit vector to both lanes of a 256-bit vector.
This patch enables lowering from subvector insertion/concatenation patterns and auto-upgrades the llvm.x86.avx.vbroadcastf128.pd.256 / llvm.x86.avx.vbroadcastf128.ps.256 intrinsics to match.
Once this is in place I can update _mm256_broadcast_ps and _mm256_broadcast_pd in the headers to use generic IR and remove the clang builtins.
We could possibly investigate using VBROADCASTF128/VBROADCASTI128 to load repeated constants as well (similar to how we already do for scalar broadcasts).