I think we can end up with packss+movmsk sequences either because the code was written that way with intrinsics or because we have a likely over-enthusiastic DAG transform that is seeking to prevent something like this with AVX1:
@@ -377,8 +386,8 @@ define i64 @test_v4i64_legal_sext(<4 x i64> %a0, <4 x i64> %a1) { ; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm3 ; AVX1-NEXT: vpcmpgtq %xmm2, %xmm3, %xmm2 ; AVX1-NEXT: vpcmpgtq %xmm1, %xmm0, %xmm0 -; AVX1-NEXT: vpackssdw %xmm2, %xmm0, %xmm0 -; AVX1-NEXT: vmovmskps %xmm0, %eax +; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0 +; AVX1-NEXT: vmovmskps %ymm0, %eax
That's also why *this* patch is limited for AVX1. I'm not sure yet what it would take to get that right in all cases.
There's potentially a better way to solve more of these patterns generally: always sink extends after shuffles, so we're shuffling bool vectors early in SDAG. That's almost certainly needed in IR to unlock some missed vector optimization, and we could repeat it here in the DAG (possibly with a hook), but I don't think it obviates the need for this patch.
Haven't check it thoroughly - but how come this isn't vmovmskpd ymm0? For both AVX1 and AVX2.