This patch fixes the poor codegen seen in PR21710 ( http://llvm.org/bugs/show_bug.cgi?id=21710 ). Before we crack 32-byte build vectors into smaller chunks (and then subsequently glue them back together), we should look for the easy case where we can just load all elements in a single op.
The codegen change for the latter 2 testcases (derived from the bug report examples) is:
vmovss 16(%rdi), %xmm1 vmovups (%rdi), %xmm0 vinsertps $16, 20(%rdi), %xmm1, %xmm1 vinsertps $32, 24(%rdi), %xmm1, %xmm1 vinsertps $48, 28(%rdi), %xmm1, %xmm1 vinsertf128 $1, %xmm1, %ymm0, %ymm0 retq
To:
vmovups (%rdi), %ymm0 retq
And:
vmovsd 16(%rdi), %xmm1 vmovupd (%rdi), %xmm0 vmovhpd 24(%rdi), %xmm1, %xmm1 vinsertf128 $1, %xmm1, %ymm0, %ymm0 retq
To:
vmovups (%rdi), %ymm0 retq
I think it's benign that we generate 'vmovups' in that 2nd case rather than 'vmovupd' because we're not using the result here. I confirmed that we will use a double instruction if we actually use the load result in this function.
I've also updated the existing load merge test to use FileCheck and added a v4f32 test for completeness.