This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Avoid shuffles with zero by using pshufb to create zeros
ClosedPublic

Authored by RKSimon on Jan 8 2015, 3:56 AM.

Details

Summary

pshufb can shuffle in zero bytes as well as bytes from a source vector - we can use this to avoid having to shuffle 2 vectors and ORing the result when the used inputs from a vector are all zeroable.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 17891.Jan 8 2015, 3:56 AM
RKSimon retitled this revision from to [X86][SSE] Avoid shuffles with zero by using pshufb to create zeros.
RKSimon updated this object.
RKSimon edited the test plan for this revision. (Show Details)
RKSimon added reviewers: chandlerc, spatel, andreadb.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: Unknown Object (MLST).

Hi Simon,

Look mostly good to me, I just have one concern with the current structure that I found error prone in case we need to update it.
See my inlined comments.

Thanks,
-Quentin

lib/Target/X86/X86ISelLowering.cpp
9602 ↗(On Diff #17891)

This is just a suggestion.
How about moving the zeroable test outside of the ‘?:’ operator.
I.e., first

 int V1Idx = ((Mask[i] < 16) ? Mask[i] : 0x80);
 int V2Idx = ((Mask[i] < 16) ? 0x80 : Mask[i] - 16);
if (Zeorable[i])
  V1Idx = V2Idx = 0x80;
9605 ↗(On Diff #17891)

I would introduce a constant for 0x80 instead of having it spread.

9606 ↗(On Diff #17891)

We already know this from the ‘?:’ statements.
Seems like worth restructuring the code to actually use a if / else.

9611 ↗(On Diff #17891)

I would structure this and the following if a bit differently.
But that is a matter of taste.

Currently we have:
if (A) {

// do1
if (!B)
  return A

}

if (B) {

// do2
if (!A)
  return B
return //do3

}

I would do =>
if (A)

// do1

if (B)

// do2

if (A && B)

return // do3

return A ? A : B;

Forgot to ask, but any performance numbers for that?

Q.

This revision was automatically updated to reflect the committed changes.

Thanks Quentin.

A basic timing test of the pshufb vs 2xpshufb+por core loop gave a 30% improvement on my older Core2Duo machine (I guess due to throughput limitations), but this diminished to less than 5% on SandyBridge. However, its main use is the reduction in register pressure, as well as the obvious fact that it was pointlessly shuffling zero vectors.