We currently don't use the PACKSS truncation combine for AVX512 targets, forcing targets without BWI to use the EXT+TRUNC pattern.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
test/CodeGen/X86/vector-compare-results.ll | ||
---|---|---|
320 | You are inserting AVX2 instructions instead of AVX-512, right? If yes, the prev code is better, since we have more registers in AVX-512. |
This patch came about as possible tweak prior to adding support AVX512BW PACKUS/PACKSS for PR34871, which was concerned about port5 load from double the number of truncations. But yes register limits is likely to be an issue. I still need to finish PR34773 first though so will come back to this in a while.
KNL has slow vpackwb ymm. It doesn't look optimal for most of these cases: vpmovzxwd / vpmovdb looks better. For AVX512F without BW+DQ, we should probably go ahead and use it, i.e. assume that AVX2 instructions are fast.
Also note that KNL has faster vpmovzx than vpmovsx.
test/CodeGen/X86/avx512-ext.ll | ||
---|---|---|
1725 | AVX2 ymm byte / word shuffles are SLOW on KNL, even though the xmm version is fast (except for pshufb). This sequence would make sense for AVX512F tune=generic (because it's very good on SKX and presumably future mainstream CPUs with good AVX2 support), but definitely *not* for -march=knl. vpacksswb xmm is fast: 1 uop / 1c throughput / 2-6c latency, but the YMM versions of vpack / vpunpck (except for DQ and QDQ) are 5 uops / 9c throughput. In this case: vpacksswb ymm + vpermq ymm = 5 + 1 = 6 shuffle uops, and maybe 10c throughput (9 + 1 assuming they all compete for the same execution resources and can't pipeline with each other). 2x vpmovzxwd y->z + 2x vpmovdb z->x + vinserti128 x->y = 5 shuffle uops, throughput = 2x2 + 2x1 + 1 = 7 cycles (with no decode stalls from multi-uop instructions). The extra ILP probably doesn't help at all because it appears there's only one shuffle execution unit (on FP0). So it's not *much* better, but avoiding the decode bottleneck should allow much better out-of-order execution and probably hyperthreading friendliness. | |
1728 | vpmovsx (all forms) is 2 uops on KNL, vs. 1 for vpmovzx (all element / register sizes). This is a big deal for the front-end (2c throughput vs. 7-8c throughput). If you're about to feed it to a truncate and only doing it to work around lack of AVX512BW, definitely use ZX. If only one vector was needed, vpmovzx %ymm,%zmm / vpmovdb %zmm, %xmm looks like a big win according to Agner Fog's uarch guide + instruction tables. | |
test/CodeGen/X86/avx512-trunc.ll | ||
568 | This is a win: two 1-uop shuffles with 1c throughput (vextracti128 / vpackuswb xmm) is definitely better than vpmovzx (1 uop / 2c throughput) / vpmovdb (1 uop / 1c throughput). And vpmovSX is 2 uops, 7-8c throughput (decode bottleneck), so the original was horrible because of the missed vpmovzx optimization, but the vpackuswb version is still better than that because it's only using XMM registers. | |
test/CodeGen/X86/vector-compare-results.ll | ||
320 | In most cases that's hopefully minor compared to the shuffle throughput gain from using vextracti128 / vpackss xmm (both 1c throughput on KNL). vpmovsx is 2 uops on KNL, so it's a big missed optimization to use it instead of vpmovzx, but even vpmovzx is 1 uop / 2c throughput (not fully pipelined). See my previous comment |
Abandoning old patch - the performant AVX512 use cases for truncateVectorWithPACK have been already been added over the years.
vpmovsx (all forms) is 2 uops on KNL, vs. 1 for vpmovzx (all element / register sizes). This is a big deal for the front-end (2c throughput vs. 7-8c throughput). If you're about to feed it to a truncate and only doing it to work around lack of AVX512BW, definitely use ZX.
If only one vector was needed, vpmovzx %ymm,%zmm / vpmovdb %zmm, %xmm looks like a big win according to Agner Fog's uarch guide + instruction tables.