Similar to D57867 - this is a 1-line patch with lots of test diffs.
In most cases with half-vector-width narrowing potential, using an extract + 128-bit vshufps is a win because it replaces a 256-bit shuffle with a 128-bit shufle.
There's 1 potentially controversial diff pattern for a target with "fast-variable-shuffle".
We are changing:
vmovaps {{.*#+}} ymm1 = [load 256-bit constant permute mask] vpermps %ymm0, %ymm1, %ymm0
to:
vextractf128 $1, %ymm0, %xmm1 vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
That could be a regression if the permute mask load could be moved out of a loop and the 256-bit op is executed at the same speed/power as a 128-bit op. But I think the extract+shufps combo is the right default choice at this level because it removes a ymm instruction. We should form 256-bit vpermps from the extract+shufps as a later optimization within a loop if that would be profitable.