This is an archive of the discontinued LLVM Phabricator instance.

AVX-512 vector shuffle lowering
ClosedPublic

Authored by delena on Jun 24 2015, 1:58 AM.

Details

Summary

Implemented lowering for 512-bit vector shuffles.
Vector types: <8 x 64>, <16 x 32>, <32 x 16> float and integer.

AVX-512 provides vector shuffle instructions with variable mask (mask in register) for one and two sources (VPERM and VPERMT2).
Use them instead of splitting vectors.

All new shuffle instructions are for integer and FP data types.

More optimizations in the next patch.

Diff Detail

Repository
rL LLVM

Event Timeline

delena updated this revision to Diff 28322.Jun 24 2015, 1:58 AM
delena retitled this revision from to AVX-512 vector shuffle lowering.
delena updated this object.
delena edited the test plan for this revision. (Show Details)
delena added reviewers: chandlerc, qcolombet, andreadb.
delena set the repository for this revision to rL LLVM.
delena added a subscriber: Unknown Object (MLST).

Ping*

  • Elena
RKSimon edited edge metadata.Jul 19 2015, 3:38 AM

The lowerVectorShuffleWithPERMV looks alright, but why did you merge the lowerV8F64VectorShuffle/lowerV8I64VectorShuffle and lowerV16F32VectorShuffle/lowerV16I32VectorShuffle functions? Yes the permutevar instructions are available on all domains but when you come to adding support for the faster immediate shuffle instructions (vpshufd/vpshufps/vpermilps/vpermilpd etc.) the implementations will diverge quite quickly again.

Also, it'd be nice if you could add shuffle decode support for the AVX512 vperm* instructions so there are helpful assembly comments.

delena added a subscriber: delena.Jul 19 2015, 8:53 AM

AVX-512 shuffles are more symmetric for FP and Int types, than divergent.
So if I split them, I'll write the same code twice. I know this because I have the full shuffle implementation on my table.
You can look at it here, but on the left side, in red.
http://reviews.llvm.org/D10502

The only "if" you'll see at line 10274, left side.

All vperm/vpermil/valign/unpack are symmetric.

Also, it'd be nice if you could add shuffle decode support for the AVX512 vperm* instructions so there are helpful assembly comments.

No problem, I will. Do you want to see this in the same patch?

  • Elena
chandlerc edited edge metadata.Jul 19 2015, 6:44 PM

AVX-512 shuffles are more symmetric for FP and Int types, than divergent.
So if I split them, I'll write the same code twice. I know this because I have the full shuffle implementation on my table.
You can look at it here, but on the left side, in red.
http://reviews.llvm.org/D10502

The only "if" you'll see at line 10274, left side.

All vperm/vpermil/valign/unpack are symmetric.

But there are other forms that aren't symmetric (and Simon mentioned some in his email).

You can still have both fp and int sides delegate to common routines for perm(il), align, and unpack (much as the AVX2 code does).

I assume the duplicated code you're frustrated by is the UNPCK lowering code? The interesting this is how much duplication would actually be removed by factoring that out. It might be worth doing, especially as the masks become huge for AVX-512. Still, that should be factored into an unpack lowering helper.

Also, it'd be nice if you could add shuffle decode support for the AVX512 vperm* instructions so there are helpful assembly comments.

No problem, I will. Do you want to see this in the same patch?

I think this should land first. It makes the tests substantially easier to review. The sequence in which these things should land:

  1. basic instruction support in the .td files
  2. shuffle mask encode/decode which is tested using target intrinsics or tested purely in MC
  3. shuffle lowering as in this patch

That way, the support each one needs to be tested lands with it or in a prior patch, and each patch is as reasonably small of a chunk for the reviewer.

Ayal added inline comments.Jul 20 2015, 7:57 AM
lib/Target/X86/X86ISelLowering.cpp
10084 ↗(On Diff #28322)

Missing space "DL,MaskEltVT"

10113 ↗(On Diff #28322)

You may want to check isSingleInputShuffleMask() here, and call either lowerVectorShuffleWithPERMV() or lowerVectorShuffleWithPERMV3(?) accordingly.

10198 ↗(On Diff #28322)

typo in original code: supprot

delena updated this revision to Diff 30490.Jul 23 2015, 8:42 AM
delena edited edge metadata.
delena removed rL LLVM as the repository for this revision.

Added shuffle decoding for VPERMV and VPERMV3 nodes.
I can't add llc comments because the instructions load mask from memory.

Added a helper for UNPCK.

Added shuffle decoding for VPERMV and VPERMV3 nodes.
I can't add llc comments because the instructions load mask from memory.

You should be able to implement this like PSHUFB is handled in X86AsmPrinter::EmitInstruction.

lib/Target/X86/X86ISelLowering.cpp
10529 ↗(On Diff #30490)

Could you put lowerVectorShuffleWithUNPCK earlier in the source file so we can use it for the 128/256 bit shuffle lowering functions? You don't have to fix the other uses in this patch if you want to keep it focussed on AVX512.

You should be able to implement this like PSHUFB is handled in X86AsmPrinter::EmitInstruction.

PSHUFBrm form loads mask from memory.
The VPERMV and VPERMV3 sets both take mask from register. Unlike PSHUFB, they both take source operand from memory.

You should be able to implement this like PSHUFB is handled in X86AsmPrinter::EmitInstruction.

PSHUFBrm form loads mask from memory.
The VPERMV and VPERMV3 sets both take mask from register. Unlike PSHUFB, they both take source operand from memory.

For PSHUFBrr forms, we work to comment the load into the register. This may already work for VPERMV and VPERMV3. You can see an example of this in vector-shuffle-128-v16.ll, @PR12412. That test shows the load of the vector constant that is used by two consecutive PSHUFB instructions.

Something that I've actually wanted to do to make these kinds of tests substantially easier to read would be to teach the instruction printer to look through trivial patterns like this. IE, walk from the register operand to its def, see if its def is a load from the constant pool, and if so, decode it into the nice shuffle syntax. Similarly, if its def is an xor of a register, decode as-if it were a zero mask. I've never had the time to go and implement it, but if this is something you're interested in, I'd love to see the patch.

Anyways, as for getting AVX-512 stuff going, if all the shuffle mask decoding logic that can be done is in place then move on to the next step. But the lack of a comment with the vector constant expanded in it in the vmovdqa64 instructions above make me think that you'll at least need to tweak the code that comments a load from a constant pool.

delena updated this revision to Diff 30682.Jul 27 2015, 2:41 AM
delena set the repository for this revision to rL LLVM.

Anyways, as for getting AVX-512 stuff going, if all the shuffle mask decoding logic that can be done is in place then move on to the next step. But the lack of a comment with the vector constant expanded in it in the vmovdqa64 instructions above make me think that you'll at least need to tweak the code that comments a load from a constant pool.

I added all AVX-512 VMOV..rm instructions to the constant pool printing.

(I'm going to an extended vacation. I'd be happy to receive LGTM, if this patch is ready. Thanks.)

I think its almost there - just a few questions that I missed earlier.

lib/Target/X86/X86ISelLowering.cpp
4627 ↗(On Diff #30682)

All this code looks very similar to the X86ISD::PSHUFB implementation - is there anyway that they can be merged?

4690 ↗(On Diff #30682)

All this code looks very similar to the X86ISD::PSHUFB implementation - is there anyway that they can be merged?

10547 ↗(On Diff #30682)

Can we take into account duplicated inputs here? What about matching the inputs if they were swapped?

delena updated this revision to Diff 33912.Sep 3 2015, 2:01 AM

I added UNPACK shuffles with swapping operands

delena added a comment.Sep 7 2015, 1:56 AM

Do you have more comments on this?

Thanks.

lib/Target/X86/X86ISelLowering.cpp
4627 ↗(On Diff #30682)

PSHUFB case has some differences, NumBytesPerElement, for example.
And a call DecodePSHUFBMask(). And broadcast analysis in VPEMV
I don't want to merge PSHUFB and VPEMV.

VPERMV and VPERMV3 have more in common. But they also different.
I don't want to merge them.

10547 ↗(On Diff #30682)

I added the "swap" case (in the new patch). As far as duplicated input, it is not AVX-512 specific. It should be a part of common shuffle simplification.

RKSimon accepted this revision.Sep 7 2015, 8:21 AM
RKSimon edited edge metadata.

LGTM. I agree that lowerVectorShuffleWithUNPCK can be revised later - and it'd be great to be able to use it for other vector types.

This revision is now accepted and ready to land.Sep 7 2015, 8:21 AM
This revision was automatically updated to reflect the committed changes.