This is an archive of the discontinued LLVM Phabricator instance.

[X86] Don't assume that a shuffle operand is #0: it isn't for VPERMV.
ClosedPublic

Authored by ab on Feb 9 2016, 2:13 PM.

Details

Summary

Since:

r246981 AVX-512: Lowering for 512-bit vector shuffles.

VPERMV is recognized in getTargetShuffleMask.

This breaks assumptions in most callers, as they expect N->getOperand(0) to be (one of) the vector operand(s). It isn't, as VPERMV has the mask as operand #0 (I can't think of another shuffle-like instruction that works the same).

In the added testcase, this leads the funny-looking:

vmovdqa .LCPI0_0(%rip), %ymm0   # ymm0 = [0,1,2,3,4,5,6,4]
vpshufb .LCPI0_1(%rip), %ymm0, %ymm0 # ymm0 = ymm0[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,16,17,18,18]

In my original testcase (s/i32 4>)/i32 1>)/ should do the trick) , the VPSHUFB lane restriction was another problem, but Simon fixed that in r260063.

I can think of two obvious solutions:

  • swap the X86ISD::VPERMV operands, commenting in X86ISelLowering.h that it's different from the instructions. IMO, it's confusing either way.
  • return the operands and fix the users. There are many users, some of which (e.g., setTargetShuffleZeroElements) only return a mask themselves. This doesn't seem perfect either.

This (very rough, WIP) patch implements the latter.

What do you think? We might improve this by having a struct wrap <Mask, IsUnary, Ops>, and hopefully avoid computing slightly different things in different places.

Diff Detail

Event Timeline

ab updated this revision to Diff 47363.Feb 9 2016, 2:13 PM
ab retitled this revision from to [X86] Don't assume that a shuffle operand is #0: it isn't for VPERMV..
ab updated this object.
ab added reviewers: RKSimon, spatel.
ab added a subscriber: llvm-commits.
spatel edited edge metadata.Feb 10 2016, 9:59 AM

Hi Ahmed -

I'm still blissfully ignorant of AVX-512, so my opinion shouldn't have as much weight as people who are working on that (cc some of the others that were on D10683?).

But I would lean towards the first solution: swap the operands for the X86ISD::VPERMV node. If I'm understanding the problem, this would (mostly?) limit the changes to the td defs. We barely document the DAG node operands or their orders anyway, so adding that kind of info seems fair to me. I think it's better to preserve the software uniformity as long as possible, even if the hardware instructions are a mess. Ie, the C instrinsics keep the expected order:
https://software.intel.com/en-us/node/524011
...so let's preserve that illusion as long as we can.

Hi Ahmed -

I'm still blissfully ignorant of AVX-512, so my opinion shouldn't have as much weight as people who are working on that (cc some of the others that were on D10683?).

But I would lean towards the first solution: swap the operands for the X86ISD::VPERMV node. If I'm understanding the problem, this would (mostly?) limit the changes to the td defs.

Yep

We barely document the DAG node operands or their orders anyway, so adding that kind of info seems fair to me. I think it's better to preserve the software uniformity as long as possible, even if the hardware instructions are a mess. Ie, the C instrinsics keep the expected order:
https://software.intel.com/en-us/node/524011
...so let's preserve that illusion as long as we can.

I agree; I'm mostly worried about us developers expecting the SD op to match the instruction.
I didn't realize the C intrinsics were swapped. IMO that's enough to justify #1.

..but now that I look it up, the AVX-512 intrinsics use the instruction order *sigh*
Back to square one.

-Ahmed

ab updated this revision to Diff 47486.Feb 10 2016, 10:40 AM
ab edited edge metadata.

Add back context to original patch.

In D17041#348716, @ab wrote:

..but now that I look it up, the AVX-512 intrinsics use the instruction order *sigh*
Back to square one.

Wow. Is it too late for Intel to fix/deprecate/rename those intrinsics? If the argument is that the intrinsics should match the asm, then what happened with the AVX2 vperm variants?

-----Original Message-----
From: llvm-commits [mailto:llvm-commits-bounces@lists.llvm.org] On Behalf
Of Sanjay Patel via llvm-commits
Sent: Wednesday, February 10, 2016 10:58 AM
To: ahmed.bougacha@gmail.com; llvm-dev@redking.me.uk;
spatel@rotateright.com
Cc: llvm-commits@lists.llvm.org
Subject: Re: [PATCH] D17041: [X86] Don't assume that a shuffle operand is
#0: it isn't for VPERMV.

spatel added a comment.

In http://reviews.llvm.org/D17041#348716, @ab wrote:

..but now that I look it up, the AVX-512 intrinsics use the instruction order

*sigh*

Back to square one.

Wow. Is it too late for Intel to fix/deprecate/rename those intrinsics? If the
argument is that the intrinsics should match the asm, then what happened
with the AVX2 vperm variants?

It is really too late. There are several other tool chains that already implemented the intrinsics as defined, and have been in use for quite a while.

http://reviews.llvm.org/D17041


llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

RKSimon added inline comments.Feb 10 2016, 2:13 PM
lib/Target/X86/X86ISelLowering.cpp
23962

Was there a need to add this? We already have all undef and all undef/zero handling below and in combineX86ShuffleChain

24630–24631

Don't these need dealing with as well? Also at the moment the getOperand(1) calls means that unary target shuffles can't use this combine as they would assert....

ab added inline comments.Feb 11 2016, 12:56 PM
lib/Target/X86/X86ISelLowering.cpp
23962

Yes, but before we can figure that out, we only checked undef (see the TODO about zero/undef below), and we crash trying to and getOpcode() on SDValue().

But to be clear: the whole patch is very mechanical, and this is a hack. I'm mainly interested in whether I should proceed with this of swap the operands.

24630–24631

This actually doesn't get called for all shuffles opcodes, though it probably should.

24647

And this looks like it should be fall through to the other cases if PerformShuffleCombine256 didn't succeed.

RKSimon edited edge metadata.Feb 20 2016, 8:28 AM

Hi Ahmed, I ended up using some of the same code to fix PR26667 - please can you rebase this patch?

ab updated this revision to Diff 48718.Feb 22 2016, 12:25 PM
ab edited edge metadata.

Rebase away fixes covered by r261433-4.

Ping of sorts: Elena, all, how do you think we should fix this: swap the VPERMV operands, or change all of our code to stop assuming the mask is the last operand (this patch)? I don't do much AVX512, so I'm not comfortable making that decision.

Ping of sorts: Elena, all, how do you think we should fix this: swap the VPERMV operands, or change all of our code to stop assuming the mask is the last operand (this patch)? I don't do much AVX512, so I'm not comfortable making that decision.

In my opinion, the mask op should not be always the last. I don't think that you need to swap VPERMV operands. As far as AVX-512 - the form is the same:
vpermd %vec_op, %mask_op, %vec_res.

I'll say you more - in AVX-512 we have two forms of 3-src shuffles:
VPERMT2D zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst Permute double-words from two tables in zmm3/m512/m32bcst and zmm1 using indices in zmm2 and store the result in zmm1 using writemask k1

VPERMI2D zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst Permute double-words from two tables in zmm3/m512/m32bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.

So, it's ok that the mask is not last in SDNode.

ab updated this revision to Diff 49433.Feb 29 2016, 4:37 PM

Simplify getTargetShuffleMask and rebase.

ab added a comment.Feb 29 2016, 4:52 PM

Ping of sorts: Elena, all, how do you think we should fix this: swap the VPERMV operands, or change all of our code to stop assuming the mask is the last operand (this patch)? I don't do much AVX512, so I'm not comfortable making that decision.

In my opinion, the mask op should not be always the last. I don't think that you need to swap VPERMV operands. As far as AVX-512 - the form is the same:
vpermd %vec_op, %mask_op, %vec_res.

I'll say you more - in AVX-512 we have two forms of 3-src shuffles:
VPERMT2D zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst Permute double-words from two tables in zmm3/m512/m32bcst and zmm1 using indices in zmm2 and store the result in zmm1 using writemask k1

VPERMI2D zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst Permute double-words from two tables in zmm3/m512/m32bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.

So, it's ok that the mask is not last in SDNode.

All else being equal, I'd actually really prefer for the mask to always be last, because it helps preserve sanity in the backend :(

So, you say it's OK for VPERMVBut I think I'm really asking the opposite: is it OK if we make the mask always be the last operand?

Otherwise, I rebased and cleaned up the patch, so reviews are appreciated!

Some comments - might be void if altering the intrinsic argument ordering is going to happen.

lib/Target/X86/X86ISelLowering.cpp
4931

If you're going to add this maybe add an assertion for Mask.empty() as well?

test/CodeGen/X86/avx2-vperm-combining.ll
3 ↗(On Diff #49433)

Possibly rename this vector-shuffle-combining-avx2.ll ? It matches the other test files that have already been added for SSSE3/AVX-only shuffle intrinsic combine tests.

16 ↗(On Diff #49433)

Add permps test as well.

lib/Target/X86/X86ISelLowering.cpp
5038

Make this comment explicit? Something like:
"Unlike most shuffle nodes, VPERMV's mask operand is operand 0."
I think we should also note this in the header file definition for any nodes that are "special".

Given that you've already done the work, I think we might as well move forward on this path.
There's really no hope of maintaining sanity. I just saw page 5-600 (p. 706) of:
https://software.intel.com/sites/default/files/managed/b4/3a/319433-024.pdf

Embrace the madness. :)

ab updated this revision to Diff 49558.Mar 1 2016, 3:43 PM
  • Cleanup and expand tests
  • Return the correct operands for VPERMV3 (can't get it to fire)
  • Document VPERMV weirdness in X86ISD
  • Assert that Mask is clear in getTargetShuffleMask; document that.
ab marked 6 inline comments as done.Mar 1 2016, 3:46 PM
ab added inline comments.
lib/Target/X86/X86ISelLowering.cpp
5038

Yeah, Elena also mentioned VPERMT2 and (the worst of all) VPERMI2. I have lost all hope!

RKSimon accepted this revision.Mar 2 2016, 3:22 AM
RKSimon edited edge metadata.

VPERMIV3 isn't being used yet but you should be able to create some VPERMV3 tests for a vector-shuffle-combining-avx512.ll test file - check avx512-intrinsics.ll for direct calls to llvm.x86.avx512.mask.vpermt2var.*

Other than that - LGTM.

This revision is now accepted and ready to land.Mar 2 2016, 3:22 AM
ab marked an inline comment as done.Mar 3 2016, 8:58 AM

So, I don't think the VPERMV3 case is currently reachable (and boy did I try):

  • VPERMV3 is only formed during lowering
  • one getTargetShuffleMask caller is XFormVExtractWithShuffleIntoLoad, which operates on extractelt nodes, which are lowered into extractelt (extract_subvector) earlier.
  • extract_subvector prevents the BLENDI and INSERTPS combines as well
  • 512-bit shuffles are only lowered into VPERMV, VPERMV3, and UNPCK, none of which is unary, so combineX86ShufflesRecursively doesn't fire either
  • getShuffleScalarElt and combineShuffles can't be called for VPERMV3, as it's only called for a select few shuffle opcodes.

I might have missed something though, so ideas welcome!

This does expose a couple problems:

  • should we delete the VPERMV3 code, and add it back once it can actually fire?
  • should we try to run combineShuffle for all of the opcodes?
This revision was automatically updated to reflect the committed changes.
ab added a comment.Mar 3 2016, 8:59 AM

Also, thanks for the reviews everyone! r262627

In D17041#367330, @ab wrote:

So, I don't think the VPERMV3 case is currently reachable (and boy did I try):

Thanks for trying Ahmed! You may have more luck after applying D17858 - not guaranteeing anything though.

I'd recommend keeping the code in for now - its going to be needed sooner rather than later.

RKSimon added inline comments.Mar 3 2016, 2:22 PM
llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
5076 ↗(On Diff #49743)

Just noticed that we're not attempting to detect unary shuffles (which probably explains why you've had so much trouble getting combines to fire):

IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(2);