This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX2] vpslldq/vpsrldq byte shifts for AVX2
ClosedPublic

Authored by RKSimon on Feb 12 2015, 10:25 AM.

Details

Summary

This patch refactors the existing lowerVectorShuffleAsByteShift function to add support for 256-bit vectors on AVX2 targets.

It also fixes a tablegen issue that prevented the lowering of vpslldq/vpsrldq vec256 instructions.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 19843.Feb 12 2015, 10:25 AM
RKSimon retitled this revision from to [X86][AVX2] vpslldq/vpsrldq byte shifts for AVX2.
RKSimon updated this object.
RKSimon edited the test plan for this revision. (Show Details)
RKSimon added reviewers: chandlerc, qcolombet, andreadb.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: Unknown Object (MLST).

I haven't looked at the patch in detail, but wondering if you plan to do anything about the AVX1 lowering of these shuffles. I would expect two 128-bit shifts for these cases in AVX1.

The AVX1 splits fail because the x86 computeZeroableShuffleElements() call isn't recursive - so can't peek through nodes (mainly shuffles but concat / subvector etc, as well do crop up). I'm hoping to give that code an overhaul in the near future - plus there is nothing x86 specific about it and would be useful to provide for others targets, the DAGCombiner etc. to check for zeros/zeroables. If you want I can do that work first and then update this patch?

No need to fix the AVX1 handling first. I was just making sure it was on your radar. I'd like to see this patch go in so we can remove more of the intrinsics.

lib/Target/X86/X86ISelLowering.cpp
7811 ↗(On Diff #19843)

llvm style is usually ++i on loop iterators. Same with a couple of the other loops.

7821 ↗(On Diff #19843)

Am I missing where the inner loop iterator is used or what it's repeating?

7837 ↗(On Diff #19843)

Can we avoid the multiply by 8 and just use the Shift value directly? Of course the patterns would also need to be fixed to just use the immediate directly instead of BYTE_imm.

Thanks Craig, I've fixed the minor issues - I'll put up a new patch shortly.

Chandler has made some changes to computeZeroableShuffleElements() that means that the AVX1 versions now see the zeroable lanes and use vpslldq/vpsrldq (the xmm versions) so that looks like its sorted too.

lib/Target/X86/X86ISelLowering.cpp
7837 ↗(On Diff #19843)

Yup we will be able to remove this multiply by 8 soon. Once this patch is in we can then remove the avx2 builtins for vpslldq/vpsrldq (similar to what happened in rL228481 for the SSE2 versions). At that point all the logic will be internal and we can replace the 'bit shift' immediate with a 'byte shift' version.

RKSimon updated this revision to Diff 19980.Feb 15 2015, 4:20 AM

Fixed minor typos, rebased and updated tests

chandlerc accepted this revision.Feb 15 2015, 4:28 AM
chandlerc edited edge metadata.

Feel free to go ahead and submit this Simon! Thanks for working on it, and glad you saw that I found and squashed the AVX1 nonsense so all this should be working now.

This revision is now accepted and ready to land.Feb 15 2015, 4:28 AM
This revision was automatically updated to reflect the committed changes.

Thanks guys, the final commit had me moving the v4i64 unpack shuffle matching after byte shift detection as it was interfering with single input shuffle matching - otherwise we were creating a zero register for no purpose.