This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Vectorized i8 and i16 shift operators
ClosedPublic

Authored by RKSimon on May 4 2015, 8:44 AM.

Details

Summary

This patch ensures that SHL/SRL/SRA shifts for i8 and i16 vectors avoid scalarization. It builds on the existing i8 SHL vectorized implementation of moving the shift bits up to the sign bit position and separating the 4, 2 & 1 bit shifts with several improvements:

1 - SSE41 targets can use (v)pblendvb directly with the sign bit instead of performing a comparison to feed into a VSELECT node.
2 - pre-SSE41 targets were masking + comparing with an 0x80 constant - we avoid this by using the fact that a set sign bit means a negative integer which can be compared against zero to then feed into VSELECT, avoiding the need for a constant mask (zero generation is much cheaper).
3 - SRA i8 needs to be unpacked to the upper byte of a i16 so that the i16 psraw instruction can be correctly used for sign extension - we have to do more work than for SHL/SRL but perf tests indicate that this is still beneficial.

The i16 implementation is similar but simpler than for i8 - we have to do 8, 4, 2 & 1 bit shifts but less shift masking is involved. SSE41 use of (v)pblendvb requires that the i16 shift amount is splatted to both bytes however.

I've updated the vectorization costs as best as I understand them. I'd like to know whether I should be providing SSE41 and AVX1 cost tables as well as these both deviate from the SSE2 version (use of (v)pblendvb and vex-encoded independent destination registers) - there appears to be very little testing of anything other than SSE2 so some advice would be helpful here.

Tested on SSE2, SSE41 and AVX machines.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 24883.May 4 2015, 8:44 AM
RKSimon retitled this revision from to [X86][SSE] Vectorized i8 and i16 shift operators.
RKSimon updated this object.
RKSimon edited the test plan for this revision. (Show Details)
RKSimon added reviewers: qcolombet, delena, andreadb.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: Unknown Object (MLST).
delena added inline comments.May 21 2015, 12:45 PM
test/CodeGen/X86/avx2-vector-shifts.ll
274 ↗(On Diff #24883)

You can extend to 2 vectors of <8 x i32> and use variable shift left VPSLLVD.

test/CodeGen/X86/vec_shift8.ll
201 ↗(On Diff #24883)

extend to 2 vectors of <4 x i32> and use PSRAD instruction

Thanks Elena, I'll get a fix for the AVX2 shifts done but can you explain the PSRAD approach please?

test/CodeGen/X86/avx2-vector-shifts.ll
274 ↗(On Diff #24883)

No problem - I'll adapt the existing 8i16 AVX2 shift code.

test/CodeGen/X86/vec_shift8.ll
201 ↗(On Diff #24883)

Can you explain this? I'm not sure what PSRAD will give us.

escha added a subscriber: escha.May 21 2015, 4:07 PM

Is it possible to do an i8 variable shift as follows? pseudocode (in0 = value being shifted, in1 = shift amounts):

vec8_multiplies: [1, 2, 4, 8, 16, 32, 64, 128]

pshufb tmp1, [vec8_multiplies], in1
pmovzxbw tmp2, in0
pmovzxbw tmp3, tmp1
pmullw tmp2, tmp3

or something like that? would it be faster, and could something of the sort maybe be doable for right shifts (multiply, then shift back down?) i16 might also be doable with appropriate hacks...

Hi Fiona,

I did look at implementing more of the shifts as multiplies, but we end up using a lot of instructions (+ registers) for the extension/truncation stages. It is easier with SSE41 instructions, but at that point we can use PBLENDVB which gives a much tighter solution. It might still be worthwhile for constant shift amounts but that would be a later patch.

RKSimon updated this revision to Diff 26325.May 22 2015, 8:03 AM

I've added AVX2 optimizations for v16i16 shifts - I investigated using extend/shift/truncation but the unpack/shift/pack approach I went with required fewer instructions.

delena edited edge metadata.May 25 2015, 4:30 AM

Hi,

I think you can commit all shifts v8i16 and v16i16 for AVX2.
I don't have time to check all other sequences.

  • Elena
RKSimon updated this revision to Diff 26467.May 25 2015, 11:44 AM
RKSimon edited edge metadata.

Thanks Elena, I've committed the AVX2 i16 shift lowering in rL238149. Updated this patch with just the pre-AVX2 lowering patterns.

RKSimon updated this revision to Diff 27033.Jun 3 2015, 5:28 AM

Ping - rebased + updated to use the new SelectionDAG::getBitcast helper function

RKSimon updated this revision to Diff 27193.Jun 5 2015, 6:03 AM

Updated patch now that we have the scalarized tests as a reference.

RKSimon updated this revision to Diff 27268.Jun 6 2015, 6:25 AM

Fixed tests to check that scalar shifts are no longer happening.

andreadb accepted this revision.Jun 9 2015, 8:47 AM
andreadb edited edge metadata.

Hi Simon,

Thanks for the patch and sorry it took a long time for me to review this patch.
The change looks good to me.

Thanks!
Andrea

lib/Target/X86/X86ISelLowering.cpp
16980 ↗(On Diff #27268)

Okay, this is because VPBLENDVB on v32i8 is only available on AVX2.

This revision is now accepted and ready to land.Jun 9 2015, 8:47 AM
This revision was automatically updated to reflect the committed changes.

Thanks Andrea