Since revision 248784 the following code generates a redundant vmov:
m128 test1(m128i const *src) {
return _mm_cvtph_ps(_mm_loadl_epi64(src));
}
Output:
vmovq (%rdi), %xmm0 # xmm0 = mem[0],zero
vcvtph2ps %xmm0, %xmm0
retq
The regression was caused by a change made in revision 247504 which teaches the instruction combiner that only the lower 64 bits of a 128-bit vcvtph2ps are used. The IR for the above code is:
define <4 x float> @_Z4testPKDv2_x(<2 x i64>* nocapture readonly %src) #0 {
entry:
%__u.i = getelementptr inbounds <2 x i64>, <2 x i64>* %src, i64 0, i64 0 %0 = load i64, i64* %__u.i, align 1, !tbaa !1 %vecinit.i = insertelement <2 x i64> undef, i64 %0, i32 0 %vecinit1.i = insertelement <2 x i64> %vecinit.i, i64 0, i32 1 %1 = bitcast <2 x i64> %vecinit1.i to <8 x i16> %2 = call <4 x float> @llvm.x86.vcvtph2ps.128(<8 x i16> %1) #2 ret <4 x float> %2
}
After r247504 the instruction combiner can see that the second insertelement is redundant and it is deleted. While this is correct, it interferes with the isel patterns that select the memory-register form of VCVTPH2PS. Previously, the load followed by the two insertelements would have been recognized by the pattern fragment 'vzmovl_v2i64'. This would then have selected a VCVTPH2PSrm via the following pattern:
// Pattern match vcvtph2ps of a scalar i64 load. def : Pat<(int_x86_vcvtph2ps_128 (vzmovl_v2i64 addr:$src)), (VCVTPH2PSrm addr:$src)>;
However, after r247504 we only have a single insertelement and the vzmovl_v2i64 is no longer matched. The reason the problem only occurs after r248784 is that prior to this the combiner was unable to peek through the bitcast operation.