This patch introduces a new DAGCombiner rule to simplify concat_vectors nodes:
concat_vectors( bitcast (scalar_to_vector %A), UNDEF) --> bitcast (scalar_to_vector %A)
This patch only partially addresses PR39257. In particular, it is enough to fix one of the two problematic cases mentioned in PR39257. However, it is not enough to fix the original test case from Craig in PR39257; that particular case would probably require a more complicated approach (and knowledge about used bits).
Before this patch, we used to generate the following code for function PR39257 (-mtriple=x86_64 , -mattr=+avx):
vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero vxorps %xmm1, %xmm1, %xmm1 vblendps $3, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[0,1],xmm1[2,3] vmovaps %ymm0, (%rsi) vzeroupper retq
Now we generate this:
vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero vmovaps %ymm0, (%rsi) vzeroupper retq
As a side note: that VZEROUPPER is completely redundant...
I guess the vzeroupper insertion pass doesn't realize that the definition of %xmm0 from vmovsd is already zeroing the upper half of %ymm0. Note that on -mcpu=btver2, we don't get that vzeroupper because pass vzeroupper insertion pass is disabled.
Use peekThroughOneUseBitcasts ?