Hi Quentin, Nadav (and all),
This patch improves the lowering of v4f32 and v4i32 build_vector dag nodes to blend/insertps.
In particular, this patch improves function 'LowerBuildVectorv4x32' which works under the following preconditions:
- the build_vector in input is not a build_vector of all-zeros;
- the build_vector in input has at least one non-zero element.
This patch improves the previous behavior as follows:
- A build_vector that performs a blend with a zero vector is converted to a shuffle.
- We now identify more opportunities to lower a build_vector into an insertps with zero masking.
About 1), this is to let the shuffle legalizer expand the dag node in a optimal way. In particular, this helps improving the codegen in cases where an insertps is selected instead of a movq or a blend (See the differences in test sse41.ll and sse2.ll).
About 2), we now get much better codegen in all the new test cases added in sse41.ll.
For example:
;;
define <4 x float> @insertps_7(<4 x float> %A, <4 x float> %B) #0 {
entry:
%vecext = extractelement <4 x float> %A, i32 0 %vecinit = insertelement <4 x float> undef, float %vecext, i32 0 %vecinit1 = insertelement <4 x float> %vecinit, float 0.000000e+00, i32 1 %vecext2 = extractelement <4 x float> %B, i32 1 %vecinit3 = insertelement <4 x float> %vecinit1, float %vecext2, i32 2 %vecinit4 = insertelement <4 x float> %vecinit3, float 0.000000e+00, i32 3 ret <4 x float> %vecinit4
}
;;
Before the backend generated the following assembly:
shufps $-27, %xmm1, %xmm1 xorps %xmm2, %xmm2 blendps $14, %xmm2, %xmm0 blendps $14, %xmm2, %xmm1 unpcklpd %xmm1, %xmm0 retq
with this patch, the backend correctly lowers the build_vector to insertps:
insertps $170, %xmm1, %xmm0 # xmm0 = xmm0[0],zero,xmm1[1],zero retq
Please let me know if ok to submit.
Thanks,
Andrea
I just realized that this check could be improved.
This assertion should check that the number of non-zero elements is strictly bigger than 1 (and not >= 1). The reason why it cannot be 1 is because build_vector nodes of i32 or f32 elements that only have one non-zero element are expanded earlier before we call this function.
I will correct it before sending.