Currently, even with SSE2 enabled, we're generating a series of movds and punpckl to insert elements in v8i16. The testcase in the patch is currently transformed to:
movd %r8d, %xmm0 movd 24(%rsp), %xmm1 punpcklwd %xmm1, %xmm0 movd %edx, %xmm1 movd 8(%rsp), %xmm2 punpcklwd %xmm2, %xmm1 punpcklwd %xmm0, %xmm1 movd %ecx, %xmm0 movd 16(%rsp), %xmm2 punpcklwd %xmm2, %xmm0 movd %r9d, %xmm2 movd %esi, %xmm3 punpcklwd %xmm2, %xmm3 punpcklwd %xmm0, %xmm3 punpcklwd %xmm1, %xmm3
Where it could be replaced by a series of pinsrw instructions, saving 8 instructions:
pinsrw $0, %esi, %xmm0 pinsrw $1, %edx, %xmm0 pinsrw $2, %ecx, %xmm0 pinsrw $3, %r8d, %xmm0 pinsrw $4, %r9d, %xmm0 pinsrw $5, 8(%rsp), %xmm0 pinsrw $6, 16(%rsp), %xmm0 pinsrw $7, 24(%rsp), %xmm0
This patch adds this change while it also looks for an opportunity where we could transform this into a SHUFFLE+VEC_INSERT_ELTS first. Most part of this patch is about moving some functions around, that will come in a separated commit.