Page MenuHomePhabricator

[X86][SSE] Match zero/any extension shuffles that don't start from the first element
ClosedPublic

Authored by RKSimon on Sep 2 2015, 10:17 AM.

Details

Summary

This patch generalizes the lowering of shuffles as zero extensions to allow extensions that don't start from the first element. It now recognises extensions starting anywhere in the lower 128-bits or at the start of any higher 128-bit lane.

The motivation was to reduce the number of high cost pshufb calls, but it also improves the SSE2 case as well.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 33821.Sep 2 2015, 10:17 AM
RKSimon retitled this revision from to [X86][SSE] Match zero/any extension shuffles that don't start from the first element.
RKSimon updated this object.
RKSimon added reviewers: chandlerc, qcolombet, andreadb.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
RKSimon updated this revision to Diff 34164.Sep 7 2015, 9:17 AM

Updated SSE2 zero extension by UNPCK - only shuffle odd offsets by a single lane to allow extension to multiple 128-bit vectors to share shuffles.

RKSimon updated this revision to Diff 34649.Sep 13 2015, 8:57 AM

Lower with pmovzx for offset and scale=2 if its for a vector size greater than 128-bits.

qcolombet edited edge metadata.Sep 17 2015, 10:51 AM

Hi Simon,

This looks mostly good to me, though it seems a couple of tests are regressing, aren’t they?

I have annotated them, could you comment on why this is beneficial or at least neutral?

Thanks,
-Quentin

lib/Target/X86/X86ISelLowering.cpp
7359 ↗(On Diff #34649)

NumEltsPerLane?

7365 ↗(On Diff #34649)

Document those preconditions on the function prototype as well.

7369 ↗(On Diff #34649)

I wonder if this lambda makes sense.
Unless I am mistaken, this value is an invariant of the lambda and can be hoisted outside of the only loop where it is used.
I.e., we can just make the test: OffsetLane == Idx/NumLaneElements at the right place.

7390 ↗(On Diff #34649)

Could you explain why in the comment?

7555 ↗(On Diff #34649)

What is “-ve"?

7576 ↗(On Diff #34649)

Explain why, I am sure it won’t be obvious when we look at that line in a couple of months :).

test/CodeGen/X86/vec_cast2.ll
61 ↗(On Diff #34649)

Isn’t the new sequence less efficient than the previous one?

test/CodeGen/X86/vector-shuffle-256-v32.ll
1759 ↗(On Diff #34649)

Ditto?

1777 ↗(On Diff #34649)

Ditto?

test/CodeGen/X86/vector-zext.ll
134 ↗(On Diff #34649)

Ditto?

212 ↗(On Diff #34649)

Ditto?

RKSimon updated this revision to Diff 35090.Sep 18 2015, 8:45 AM
RKSimon edited edge metadata.

Updated based on feedback

RKSimon marked 5 inline comments as done.Sep 18 2015, 9:01 AM

Regarding the tests - the over use of PSHUFB can be a performance issue - you often have to load the shuffle mask (>5cy) and many targets (AMD + Intel) have a poor latency/throughput executing it (>3cy). In comparison PSHUFD/PSRLDQ as well as PMOVZX are pretty light instructions (1-2cy each).

lib/Target/X86/X86ISelLowering.cpp
7359 ↗(On Diff #35090)

I've pulled the 'OffsetLane' constant out - I'd prefer to keep the lambda though as its used in quite a few places and avoids cluttering the code.

7378 ↗(On Diff #35090)

This is definitely a borderline case - without the earlyout most of the time we are just replacing a XOR (zero)+PUNCKH with a PSHUFD+PMOVZX. There's next to nothing in it so I went with avoiding a change.

qcolombet accepted this revision.Sep 21 2015, 11:04 AM
qcolombet edited edge metadata.

Regarding the tests - the over use of PSHUFB can be a performance issue - you often have to load the shuffle mask (>5cy) and many targets (AMD + Intel) have a poor latency/throughput executing it (>icy).

Ah, that’s the catch, loading the mask! From the diff that wasn’t clear since we do not see the actual operands.

Alright, LGTM then!

Thanks,
-Quentin

This revision is now accepted and ready to land.Sep 21 2015, 11:04 AM
This revision was automatically updated to reflect the committed changes.
RKSimon marked an inline comment as done.