We currently don't use the PACKSS truncation combine for AVX512 targets, forcing targets without BWI to use the EXT+TRUNC pattern.
This patch came about as possible tweak prior to adding support AVX512BW PACKUS/PACKSS for PR34871, which was concerned about port5 load from double the number of truncations. But yes register limits is likely to be an issue. I still need to finish PR34773 first though so will come back to this in a while.
KNL has slow vpackwb ymm. It doesn't look optimal for most of these cases: vpmovzxwd / vpmovdb looks better. For AVX512F without BW+DQ, we should probably go ahead and use it, i.e. assume that AVX2 instructions are fast.
Also note that KNL has faster vpmovzx than vpmovsx.
AVX2 ymm byte / word shuffles are SLOW on KNL, even though the xmm version is fast (except for pshufb). This sequence would make sense for AVX512F tune=generic (because it's very good on SKX and presumably future mainstream CPUs with good AVX2 support), but definitely *not* for -march=knl.
vpacksswb xmm is fast: 1 uop / 1c throughput / 2-6c latency, but the YMM versions of vpack / vpunpck (except for DQ and QDQ) are 5 uops / 9c throughput.
In this case: vpacksswb ymm + vpermq ymm = 5 + 1 = 6 shuffle uops, and maybe 10c throughput (9 + 1 assuming they all compete for the same execution resources and can't pipeline with each other).
2x vpmovzxwd y->z + 2x vpmovdb z->x + vinserti128 x->y = 5 shuffle uops, throughput = 2x2 + 2x1 + 1 = 7 cycles (with no decode stalls from multi-uop instructions). The extra ILP probably doesn't help at all because it appears there's only one shuffle execution unit (on FP0). So it's not *much* better, but avoiding the decode bottleneck should allow much better out-of-order execution and probably hyperthreading friendliness.
vpmovsx (all forms) is 2 uops on KNL, vs. 1 for vpmovzx (all element / register sizes). This is a big deal for the front-end (2c throughput vs. 7-8c throughput). If you're about to feed it to a truncate and only doing it to work around lack of AVX512BW, definitely use ZX.
If only one vector was needed, vpmovzx %ymm,%zmm / vpmovdb %zmm, %xmm looks like a big win according to Agner Fog's uarch guide + instruction tables.
This is a win: two 1-uop shuffles with 1c throughput (vextracti128 / vpackuswb xmm) is definitely better than vpmovzx (1 uop / 2c throughput) / vpmovdb (1 uop / 1c throughput).
And vpmovSX is 2 uops, 7-8c throughput (decode bottleneck), so the original was horrible because of the missed vpmovzx optimization, but the vpackuswb version is still better than that because it's only using XMM registers.
In most cases that's hopefully minor compared to the shuffle throughput gain from using vextracti128 / vpackss xmm (both 1c throughput on KNL).
vpmovsx is 2 uops on KNL, so it's a big missed optimization to use it instead of vpmovzx, but even vpmovzx is 1 uop / 2c throughput (not fully pipelined). See my previous comment