This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Avoid scalarization of v2i64 vector shifts
ClosedPublic

Authored by RKSimon on Mar 18 2015, 8:09 AM.

Details

Summary

Currently v2i64 vectors shifts (non-equal shift amounts) are scalarized, costing 4 x extract, 2 x x86-shifts and 2 x insert instructions - and it gets even more awkward on 32-bit targets.

This patch separately shifts the vector by both shift amounts and then shuffles the partial results back together, costing 2 x shuffles and 2 x sse-shifts instructions (+ 2 movs on pre-AVX hardware).

Note - this patch only improves the SHL / LSHR shifts as ASHR 2i64 shifts aren't currently supported in hardware - I'm looking at fixing this in a future patch.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 22186.Mar 18 2015, 8:09 AM
RKSimon retitled this revision from to [X86][SSE] Avoid scalarization of v2i64 vector shifts.
RKSimon updated this object.
RKSimon edited the test plan for this revision. (Show Details)
RKSimon added reviewers: qcolombet, mkuper, andreadb, spatel.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: Unknown Object (MLST).
andreadb edited edge metadata.Mar 18 2015, 9:14 AM

Hi Simon,

please see comments below.

lib/Target/X86/X86ISelLowering.cpp
16194 ↗(On Diff #22186)

This would generate worse code if Op.getOpcode() == ISD::SRA.
You should check that the opcode is not ISD::SRA. Otherwise, you would end up scalarizing two shifts.

test/CodeGen/X86/x86-shifts.ll
122–123 ↗(On Diff #22186)

Could you please add a check for the shift count?
Something like
CHECK-DAG: psrlq $8
CHECK-DAG: psrlq $1

It would be nice to also have checks for the two extra 'punpcklqdq shuffles that would be generated by your patch.

Thanks Andrea, I'll update a new version of the patch later today.

lib/Target/X86/X86ISelLowering.cpp
16194 ↗(On Diff #22186)

Yes - i64 SRA isn't currently supported (and doesn't go through LowerShift at all atm) but I will add the check. Initial tests indicate that SRA would be faster for constant shifts (and all AVX2 implementations) - but as I said I'll deal with SRA properly in a future patch.

test/CodeGen/X86/x86-shifts.ll
122–123 ↗(On Diff #22186)

Yes - I'll add more complete CHECK lines for all my changes.

RKSimon updated this revision to Diff 22202.Mar 18 2015, 12:06 PM
RKSimon edited edge metadata.

Updated patch based on Andrea's comments - checked against ashr and improved tests

andreadb accepted this revision.Mar 18 2015, 12:10 PM
andreadb edited edge metadata.

LGTM. Thanks Simon!

This revision is now accepted and ready to land.Mar 18 2015, 12:10 PM
This revision was automatically updated to reflect the committed changes.

Thanks Andrea - I'll prepare a i64 ASHR patch in the next few days.