This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Vectorized v4i32 non-uniform shifts.
ClosedPublic

Authored by RKSimon on Jul 9 2015, 8:09 AM.

Details

Summary

While the v4i32 shl operation is already vectorized using a cvttps2dq/pmulld pattern, the lshr/ashr opeations are still scalarized.

This patch adds vectorization support for non-uniform v4i32 shift operations - it splats constant shift amounts to allow them to use the immediate sse shift instructions, or extracts/zero-extends non-constant shift amounts. The individual results are then blended together.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 29334.Jul 9 2015, 8:09 AM
RKSimon retitled this revision from to [X86][SSE] Vectorized v4i32 non-uniform shifts..
RKSimon updated this object.
RKSimon added reviewers: qcolombet, delena, spatel, andreadb.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
qcolombet accepted this revision.Jul 9 2015, 4:15 PM
qcolombet edited edge metadata.

Hi Simon,

LGTM with a few nitpicks to help coming back to the code.

Please commit directly your updated version.

Cheers,
-Quentin

lib/Target/X86/X86ISelLowering.cpp
17383

Period.

17384

The wording is strange.

17392

We could use UNDEF for the second operand, that should avoid the generic code to have to canonicalize it.

17395

Wouldn’t it make sense to leave more freedom to the next optimizer with more undef indexes:
0, -1, -1, -1
-1, 1, -1, -1
etc.

It looks to me that we have a too good idea of what the lowering should look like and we over-specify the data.

17398

Maybe add a note saying that SHL v4i32 is handled earlier in this function.

17399

Is this case reachable?

I thought we were handling SHL v4i32 earlier in this function (line 17300).
Though I guess it does not hurt to have it here.

17407

I guess you use 0, 4, then 1, 5, etc. instead of 0, 4, then 1, 4 etc. because masks are legal for shuffles.
If that is the case, then add a comment, if not, then maybe just use 4 for all the zero vector.

Also a comment saying that X86 shifts:

  • Use only the 64 first bit of the register for the value of the amount.
  • Shift all the lanes by the first amount (i.e., the first 64-bit like previously said), unlike LLVM shifts where each lane is shift by the related index.

would help reading the code.

Part of the information is at the being of the block, but I think it is a cryptic unless you know the actual instructions.

Right now, unless you have the intel documentation in front of you, this is not that easy to read.

This revision is now accepted and ready to land.Jul 9 2015, 4:15 PM

Thanks Quentin, I'll commit the patch with updates later today. Comments below.

lib/Target/X86/X86ISelLowering.cpp
17395

If we do this then the shuffle gets removed (so we don't remember that the other lanes are undef) meaning that we don't recognise it as a splat.

17399

The reason that SHL is there is that I've found that this looks like its faster for non-constants than the cvttps2dq/pmuludq approach on older (pre-SSE41) targets. I'm still testing this though (I don't have that wide a range of older hardware these days). so haven't made it the default yet. I'll add a comment.

This revision was automatically updated to reflect the committed changes.