As reported on PR51796, the _mm256_loadu2_m128i in particular was inserting bitcasts and shuffles with different types making it trickier for some combines, and prevented the value tracker from identifying the shuffle sequences as a single insert_subvector style concat_vectors pattern.
This patch instead concatenate the 128-bit unaligned loads with _mm256_set_m128*, which was written to avoid the unnecessary bitcasts and only emits a single shuffle.
clang-tidy: error: "Never use <avxintrin.h> directly; include <immintrin.h> instead." [clang-diagnostic-error]
not useful