Page MenuHomePhabricator

[X86][AVX512] Use PACKSS/PACKUS for vXi16->vXi8 truncations without BWI.
Needs ReviewPublic

Authored by RKSimon on Nov 18 2017, 2:12 PM.

Details

Summary

We currently don't use the PACKSS truncation combine for AVX512 targets, forcing targets without BWI to use the EXT+TRUNC pattern.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Nov 18 2017, 2:12 PM
delena added inline comments.Nov 18 2017, 10:42 PM
test/CodeGen/X86/vector-compare-results.ll
320

You are inserting AVX2 instructions instead of AVX-512, right? If yes, the prev code is better, since we have more registers in AVX-512.

This patch came about as possible tweak prior to adding support AVX512BW PACKUS/PACKSS for PR34871, which was concerned about port5 load from double the number of truncations. But yes register limits is likely to be an issue. I still need to finish PR34773 first though so will come back to this in a while.

KNL has slow vpackwb ymm. It doesn't look optimal for most of these cases: vpmovzxwd / vpmovdb looks better. For AVX512F without BW+DQ, we should probably go ahead and use it, i.e. assume that AVX2 instructions are fast.

Also note that KNL has faster vpmovzx than vpmovsx.

test/CodeGen/X86/avx512-ext.ll
1725

AVX2 ymm byte / word shuffles are SLOW on KNL, even though the xmm version is fast (except for pshufb). This sequence would make sense for AVX512F tune=generic (because it's very good on SKX and presumably future mainstream CPUs with good AVX2 support), but definitely *not* for -march=knl.

vpacksswb xmm is fast: 1 uop / 1c throughput / 2-6c latency, but the YMM versions of vpack / vpunpck (except for DQ and QDQ) are 5 uops / 9c throughput.

In this case: vpacksswb ymm + vpermq ymm = 5 + 1 = 6 shuffle uops, and maybe 10c throughput (9 + 1 assuming they all compete for the same execution resources and can't pipeline with each other).

2x vpmovzxwd y->z + 2x vpmovdb z->x + vinserti128 x->y = 5 shuffle uops, throughput = 2x2 + 2x1 + 1 = 7 cycles (with no decode stalls from multi-uop instructions). The extra ILP probably doesn't help at all because it appears there's only one shuffle execution unit (on FP0). So it's not *much* better, but avoiding the decode bottleneck should allow much better out-of-order execution and probably hyperthreading friendliness.

1728

vpmovsx (all forms) is 2 uops on KNL, vs. 1 for vpmovzx (all element / register sizes). This is a big deal for the front-end (2c throughput vs. 7-8c throughput). If you're about to feed it to a truncate and only doing it to work around lack of AVX512BW, definitely use ZX.

If only one vector was needed, vpmovzx %ymm,%zmm / vpmovdb %zmm, %xmm looks like a big win according to Agner Fog's uarch guide + instruction tables.

test/CodeGen/X86/avx512-trunc.ll
568

This is a win: two 1-uop shuffles with 1c throughput (vextracti128 / vpackuswb xmm) is definitely better than vpmovzx (1 uop / 2c throughput) / vpmovdb (1 uop / 1c throughput).

And vpmovSX is 2 uops, 7-8c throughput (decode bottleneck), so the original was horrible because of the missed vpmovzx optimization, but the vpackuswb version is still better than that because it's only using XMM registers.

test/CodeGen/X86/vector-compare-results.ll
320

In most cases that's hopefully minor compared to the shuffle throughput gain from using vextracti128 / vpackss xmm (both 1c throughput on KNL).

vpmovsx is 2 uops on KNL, so it's a big missed optimization to use it instead of vpmovzx, but even vpmovzx is 1 uop / 2c throughput (not fully pipelined). See my previous comment