Pre-AVX512 (which can perform a quick extend/shift/truncate), extending to 2 v8i16 for the PMULLW and then truncating is more performant than relying on the generic PBLENDVB vXi8 shift path and uses a similar amount of mask constant pool data.
Details
Details
Diff Detail
Diff Detail
- Repository
- rL LLVM
Event Timeline
Comment Actions
Some cases AVX2 gives smaller code with the PBLENDVB generic shift pattern - but I didn't spend that much time looking at it - I'll take another look.
test/CodeGen/X86/vector-shift-shl-256.ll | ||
---|---|---|
1027 ↗ | (On Diff #154200) | I think we're best off doing this just for v16i8, enabling v32i8 increases AVX2 to: ; AVX2-LABEL: constant_shift_v32i8: ; AVX2: # %bb.0: ; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1 ; AVX2-NEXT: vpmovsxbw %xmm1, %ymm1 ; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [1,2,4,8,16,32,64,65408,65408,64,32,16,8,4,2,1] ; AVX2-NEXT: vpmullw %ymm2, %ymm1, %ymm1 ; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm3 ; AVX2-NEXT: vmovdqa {{.*#+}} xmm4 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u> ; AVX2-NEXT: vpshufb %xmm4, %xmm3, %xmm3 ; AVX2-NEXT: vpshufb %xmm4, %xmm1, %xmm1 ; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0] ; AVX2-NEXT: vpmovsxbw %xmm0, %ymm0 ; AVX2-NEXT: vpmullw %ymm2, %ymm0, %ymm0 ; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm2 ; AVX2-NEXT: vpshufb %xmm4, %xmm2, %xmm2 ; AVX2-NEXT: vpshufb %xmm4, %xmm0, %xmm0 ; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0] ; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0 ; AVX2-NEXT: retq |