This is the deferred part of D58181. I think this is the right transform by default for these tests regardless of hasFastVariableShuffle(). Ie, we should only form the wide shuffle op when we know that is profitable because a shuffle mask load + ymm op are probably not faster in general than the narrow alternative.
Unless we're running out of registers, a loop that can hoist the load of the shuffle-control vector would certainly benefit from vpermd / vpermps in some of these cases.
On SnB-family, vextracti/f128 and vinserti/f128 cost the same as YMM vperm2f128 or vpermq (immediate lane-crossing shuffles), and the same as vpermd if we have the control vector already loaded. This is *very* different from AMD bdver* / znver1 / jaguar, where insert/extract are *very* cheap, and we should avoid vperm2f128 whenever possible. (In fact emulating it with insert + extract is probably a win for tune=jaguar / bdver* / znver1. Probably not for znver2, which is expected to have 256-bit ALUs.)
It looks like some of these shuffles would be better done with a vshufps YMM to combine the data we want into one register, then a vpermq YMM, imm to reorder 64-bit chunks. On Haswell/Skylake, that gives us 1c + 3c latency, and only 2 shuffle uops. Much better than 2x vpermps + vinsertf128. (And what are we doing using FP shuffles on integer vectors? It's probably not going to cause bypass delays, but FP blends can and we use vblendps instead of vpblendd.)
If we're worried about the size of vector constants, something like [0,2,4,6,4,6,6,7] can easily be compressed to 8-bit elements and loaded with vpmovzxbd (or sx for negative elements). That's fine outside a loop. We wouldn't want an extra load+shuffle inside a loop, though. vpmovzx can't micro-fuse with a YMM destination on SnB-family.
In some cases I think we can use vpackusdw (and then a lane-crossing fixup with vpermq immediate), but we'd have to prepare both inputs with VPAND to zero the high input bits, creating non-negative inputs for i32 -> u16 saturation. Or vpackuswb for i16 -> u8. This could let us do the equivalent of vshufps for narrower elements. We can create the AND mask on the fly from vpcmpeqd same,same / vpsrld ymm, 16. (And for AVX512 we don't need it because AVX512 has narrowing with truncation as an option instead of saturation.)
If we already need to shift, we can use vpsrad to create sign-extended inputs for vpackssdw i32 -> i16 saturation.
This is really bad on Haswell/Skylake, where each of these shuffles costs a port 5 uop. Instead of 2x 128-bit vshufps, we should be doing 1x 256-bit vshufps and then doing a cross-lane fixup with vpermq if we have AVX2 available.
It's the same shuffle for both lanes, so it's crazy to extract and do it separately unless we're lacking AVX2.
vshufps imm8 # ymm0 = ymm1[1,3],ymm2[1,3],ymm1[5,7],ymm2[5,7] vpermq imm8 # ymm0 = ymm0[0,2,1,3] ret
This should be optimal even on bdver* / jaguar / znver1 where vpermq / vpermpd is 3 uops / 2 cycles (Zen numbers).
Thanks, Peter. 'HasFastVariableShuffle' is a heuristic, so we're never going to get it right all the time, but it sounds like we should leave this particular bit of logic as-is and try to chisel out smaller patterns/transforms for sure wins.
I'll leave this open for a bit in case there are more comments, but I assume we'll abandon it.