As noted in PR23073 ( https://llvm.org/bugs/show_bug.cgi?id=23073 ),
for code like this:
define <8 x i32> @load_v8i32() { ret <8 x i32> <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> }
We produce this AVX code:
_load_v8i32: ## @load_v8i32 movl $7, %eax vmovd %eax, %xmm0 vxorps %ymm1, %ymm1, %ymm1 vblendps $1, %ymm0, %ymm1, %ymm0 ## ymm0 = ymm0[0],ymm1[1,2,3,4,5,6,7] retq
There are at least 2 bugs in play here:
- We're generating a blend when a move scalar does the same job using 2 less instruction bytes.
- We're not matching an existing pattern that would eliminate the xor and blend entirely. The zero bytes are free with vmovd.
The 2nd fix involves an adjustment of "AddedComplexity" [1] and masks the 1st problem, but I went ahead with a partial fix for that problem in case we ever match that pattern. I'm not sure how to do it, so I don't have an additional test case for it. I'll address the remaining FIXMEs if nobody sees any problems with this patch.
[1] AddedComplexity has close to no documentation in the source. The best we have is this comment: "roughly corresponds to the number of nodes that are covered". It appears that x86 has bastardized this definition by inflating its values for some other undocumented reason. For example, we have a pattern with "AddedComplexity = 400" (!). I searched my way to this page:
https://groups.google.com/forum/#!topic/llvm-dev/5UX-Og9M0xQ
Why is this not ALL-NEXT anymore?