This is an archive of the discontinued LLVM Phabricator instance.

[X86] Improved lowering of v4x32 build_vector dag nodes.
ClosedPublic

Authored by andreadb on Nov 18 2014, 11:32 AM.

Details

Summary

Hi Quentin, Nadav (and all),

This patch improves the lowering of v4f32 and v4i32 build_vector dag nodes to blend/insertps.
In particular, this patch improves function 'LowerBuildVectorv4x32' which works under the following preconditions:

  • the build_vector in input is not a build_vector of all-zeros;
  • the build_vector in input has at least one non-zero element.

This patch improves the previous behavior as follows:

  1. A build_vector that performs a blend with a zero vector is converted to a shuffle.
  2. We now identify more opportunities to lower a build_vector into an insertps with zero masking.

About 1), this is to let the shuffle legalizer expand the dag node in a optimal way. In particular, this helps improving the codegen in cases where an insertps is selected instead of a movq or a blend (See the differences in test sse41.ll and sse2.ll).

About 2), we now get much better codegen in all the new test cases added in sse41.ll.

For example:
;;
define <4 x float> @insertps_7(<4 x float> %A, <4 x float> %B) #0 {
entry:

%vecext = extractelement <4 x float> %A, i32 0
%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
%vecinit1 = insertelement <4 x float> %vecinit, float 0.000000e+00, i32 1
%vecext2 = extractelement <4 x float> %B, i32 1
%vecinit3 = insertelement <4 x float> %vecinit1, float %vecext2, i32 2
%vecinit4 = insertelement <4 x float> %vecinit3, float 0.000000e+00, i32 3
ret <4 x float> %vecinit4

}
;;

Before the backend generated the following assembly:

shufps $-27, %xmm1, %xmm1
xorps %xmm2, %xmm2
blendps $14, %xmm2, %xmm0
blendps $14, %xmm2, %xmm1
unpcklpd %xmm1, %xmm0
retq

with this patch, the backend correctly lowers the build_vector to insertps:

insertps $170, %xmm1, %xmm0 # xmm0 = xmm0[0],zero,xmm1[1],zero
retq

Please let me know if ok to submit.
Thanks,
Andrea

Diff Detail

Repository
rL LLVM

Event Timeline

andreadb updated this revision to Diff 16345.Nov 18 2014, 11:32 AM
andreadb retitled this revision from to [X86] Improved lowering of v4x32 build_vector dag nodes..
andreadb updated this object.
andreadb edited the test plan for this revision. (Show Details)
andreadb added reviewers: qcolombet, nadav, grosbach, delena.
andreadb added a subscriber: Unknown Object (MLST).
andreadb added inline comments.Nov 18 2014, 12:59 PM
lib/Target/X86/X86ISelLowering.cpp
5752–5754 ↗(On Diff #16345)

I just realized that this check could be improved.
This assertion should check that the number of non-zero elements is strictly bigger than 1 (and not >= 1). The reason why it cannot be 1 is because build_vector nodes of i32 or f32 elements that only have one non-zero element are expanded earlier before we call this function.
I will correct it before sending.

andreadb updated this revision to Diff 16378.Nov 19 2014, 4:41 AM

Uploaded a new version of the patch.
This time we correctly check the precondition on the number of non-zero elements in input to the build_vector.
Also, added a missing check on the value type of vectors in input to extract_vector_elt dag nodes.

qcolombet accepted this revision.Nov 19 2014, 10:49 AM
qcolombet edited edge metadata.

Hi Andrea,

LGTM.

Thanks,
-Quentin

lib/Target/X86/X86ISelLowering.cpp
5784 ↗(On Diff #16378)

Just add a comment that the zero vector will be on the RHS, that’s why it is EltIdx + 4.

5789 ↗(On Diff #16378)

Add a comment that by construction Elt is a EXTRACT_VECTOR_ELT with constant index.

This revision is now accepted and ready to land.Nov 19 2014, 10:49 AM
Diffusion closed this revision.Nov 19 2014, 11:35 AM
Diffusion updated this revision to Diff 16394.

Closed by commit rL222375 (authored by adibiagio).

Thanks Quentin!

I added the extra comments in the code.
Committed revision 222375.