Further perf tests on Jaguar indicate that:
vxorps %ymm0, %ymm0, %ymm0 vcmpps $15, %ymm0, %ymm0, %ymm0
is consistently faster (by about 9%) than:
vpcmpeqd %xmm0, %xmm0, %xmm0 vinsertf128 $1, %xmm0, %ymm0, %ymm0
Testing equivalent code on a SandyBridge (E5-2640) puts it slightly (~3%) faster as well.
OK to commit, this change is only related to AVX1?
Should this be "OptForMinSize"? Ie, I was just looking at some other code patterns, and we are inserting xorps/xorpd even at -Os.
In fact, we're doing that even at -Oz which I think is a mistake.