This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Improved (v)insertps shuffle matching
ClosedPublic

Authored by RKSimon on Jan 8 2015, 10:17 AM.

Details

Summary

In the current code we only attempt to match against insertps if we have exactly one element from the second input vector, irrespective of how much of the shuffle result is zeroable.

This patch checks to see if there is a single non-zeroable element from either input that requires insertion. It also supports matching of cases where only one of the inputs need to be referenced.

We also split (v)insertps shuffle matching off into a new lowerVectorShuffleAsInsertPS function.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 17902.Jan 8 2015, 10:17 AM
RKSimon retitled this revision from to [X86][SSE] Improved (v)insertps shuffle matching.
RKSimon updated this object.
RKSimon edited the test plan for this revision. (Show Details)
RKSimon added reviewers: chandlerc, andreadb, spatel.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: Unknown Object (MLST).
qcolombet accepted this revision.Jan 8 2015, 4:07 PM
qcolombet added a reviewer: qcolombet.
qcolombet added a subscriber: qcolombet.

Hi Simon,

LGTM with more comments to ease futur modifications.
See the inlined comments.

Any performances numbers related to this?

Thanks,
-Quentin

lib/Target/X86/X86ISelLowering.cpp
8178 ↗(On Diff #17902)

We can use a bool for that and I would also change the name to something like IsV1Used.

8205 ↗(On Diff #17902)

Add a comment saying that if we get here, V2 is not used but one element of V1 is moved.
Therefore, V1 will be used as inplace element (first operand) and as the source of insertion (second operand).

I think that would make clearer why V2 gets V1’s information and ease future reading/modifications.

8209 ↗(On Diff #17902)

Add a comment saying that unlike LLVM, the index is from the start of the current operand, not the start of the concatenated vector. I.e., indexes in {V1[4] V2[4]} targeting V2 are equals to original index - numberOfElts(V1).

8212 ↗(On Diff #17902)

Add a comment saying that V1 is not used, which means it is zeroable.

8214 ↗(On Diff #17902)

No curly brackets.

8536 ↗(On Diff #17902)

We can remove this block with the previous one to get rid of one if.

This revision is now accepted and ready to land.Jan 8 2015, 4:07 PM
qcolombet added inline comments.Jan 8 2015, 4:10 PM
lib/Target/X86/X86ISelLowering.cpp
8536 ↗(On Diff #17902)

s/remove/merge/

chandlerc edited edge metadata.Jan 8 2015, 4:16 PM

Additional nits from Quentin's comments. Please try to work on some of the coding style issues.

lib/Target/X86/X86ISelLowering.cpp
8182–8197 ↗(On Diff #17902)

I've made this comment is several review threads. *please* use continue rather than long if-else chains in loops.

8190–8191 ↗(On Diff #17902)

assymetric braces are really werid.

8222 ↗(On Diff #17902)

Indent is wrong here. Again, please use clang-format.

This revision was automatically updated to reflect the committed changes.

Thanks for the feedback guys - apologies for the code style problems, they should be fixed now.

Regarding performance - its tricky to give specific numbers as insertps gets matched against a wide variety of masks, but if we assume that an insertps instruction replaces a xorps (zero) and 2 dependant shufps, on a brief test I'm seeing a 35% boost on Core2Duo.