This is similar logic/motivation to the select splitting in D62969.
In D63233, the pattern changes so that we no longer have an extract_subvector of vselect, but the operands of the select are still being concatenated.
The closest case is represented in either the first or last test diffs here - we have an extra instruction, but we converted 3-4 ymm instructions into 4-5 xmm instructions. I think that's the right trade-off for most AVX1 targets.
In the example based on PR37428:
https://bugs.llvm.org/show_bug.cgi?id=37428
...this makes the loop about 30% faster (tested on Haswell by compiling with -mavx).