This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Use PSLLDQ/PSRLDQ to mask out zeroable ends of a shuffle
ClosedPublic

Authored by RKSimon on Jan 16 2019, 7:25 AM.

Details

Summary

As suggested on PR40318, this patch uses PSLLDQ/PSRLDQ to lower shuffles to zero out the ends of a vector, leaving a sequential inner section.

For pre-SSSE3 we do this for shuffles with zeros at either end (requiring up to 3 shifts), but once PSHUFB is available I've limited this to shuffles with a single zeroable end (2 shifts).

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Jan 16 2019, 7:25 AM

The double-shift cases look good, but I'm skeptical about the triple-shift. Wouldn't those always be better with an 'and' mask followed by shift? We reduce the dependent chain of vector ops and instruction count for the cost of a speculatable constant pool load.

The double-shift cases look good, but I'm skeptical about the triple-shift. Wouldn't those always be better with an 'and' mask followed by shift? We reduce the dependent chain of vector ops and instruction count for the cost of a speculatable constant pool load.

I did consider that but then we contradict the "3 op limit" for older machines (like pre-SSSE3) before using "variable" shuffle masks - which includes AND masks.

The double-shift cases look good, but I'm skeptical about the triple-shift. Wouldn't those always be better with an 'and' mask followed by shift? We reduce the dependent chain of vector ops and instruction count for the cost of a speculatable constant pool load.

I did consider that but then we contradict the "3 op limit" for older machines (like pre-SSSE3) before using "variable" shuffle masks - which includes AND masks.

Ah, so we would expect an even later transform (combineX86ShufflesRecursively?) to squash that. Worth adding a TODO comment about that? Or maybe nobody cares about pre-SSSE3 perf that much to bother.

lib/Target/X86/X86ISelLowering.cpp
10939–10940

The pair part of the comment is over-specific for the top-level - move it below where we have the example sequences?

10960–10963

Could we do the simpler check/assert that V2 has been canonicalized to a zero constant?

lebedev.ri added inline comments.Jan 24 2019, 12:31 AM
test/CodeGen/X86/buildvec-extract.ll
408

I'm having trouble reading this pretty-print.
Shouldn't it be something more like

psrldq {{.*#+}} xmm0 = zero,zero,zero,zero,zero,zero,xmm0[14,15],zero,zero,zero,zero,zero,zero,zero,zero

?
Otherwise to me it reads as-if it wasn't zext i16 to 16, but zext i16 to i64 + shl (64-16)
(i.e. zeros are not in MSB, but LSB)

craig.topper added inline comments.Jan 24 2019, 2:30 PM
test/CodeGen/X86/buildvec-extract.ll
408

The pretty printer prints LSB first. So the pslldq is putting bytes 0, 1, 2, and 3 of the original vector in the MSBs. Then the pslrdq takes bytes 14 and 15 from that which are really bytes 2 and 3 of the input and moves them to byte 0 and 1 of the output.

RKSimon marked an inline comment as done.Jan 29 2019, 4:27 AM
RKSimon added inline comments.
test/CodeGen/X86/buildvec-extract.ll
408

The joys of x86 idiosyncrasies..... This is a shuffle asm comment and that is the shuffle that is created.

RKSimon marked an inline comment as done.Jan 29 2019, 5:36 AM
RKSimon added inline comments.
lib/Target/X86/X86ISelLowering.cpp
10960–10963

Not easily, our shuffle mask canonicalization doesn't try to account for zero ops as it regresses shuffles in cases where we can't make use of it being zero.

RKSimon updated this revision to Diff 184065.Jan 29 2019, 5:38 AM

rebased and updated comments

Herald added a project: Restricted Project. · View Herald TranscriptFeb 1 2019, 1:26 AM
spatel accepted this revision.Feb 1 2019, 6:09 AM

LGTM - but I think we should have a bug report and/or a TODO comment somewhere about the triple-shift if it's not there already.

This revision is now accepted and ready to land.Feb 1 2019, 6:09 AM
This revision was automatically updated to reflect the committed changes.