Page MenuHomePhabricator

[X86][AVX] Only shuffle the lower half of vectors if the upper half is undefined

Authored by RKSimon on Dec 12 2015, 6:17 AM.



First step towards making better use of AVX's implicit zeroing of the upper half of a 256-bit vector by instructions that only act on the lower 128-bit vector - discussed on D14151.

As well as the fact that 128-bit shuffle instructions are generally more capable, this can be performant for older CPUs with 128-bit ALUs (e.g. Jaguar, Sandy Bridge) that must treat 256-bit vectors as multiple micro-ops.

Note: I've avoided combining shuffles that reference elements from the upper halves of the input vectors - this may be reviewed in future work as well (AVX1 would probably always gain, but AVX2 does have some cross-lane shuffle instructions).

Diff Detail


Event Timeline

RKSimon updated this revision to Diff 42638.Dec 12 2015, 6:17 AM
RKSimon retitled this revision from to [X86][AVX] Only shuffle the lower half of vectors if the upper half is undefined.
RKSimon updated this object.
RKSimon added reviewers: andreadb, spatel, congh, delena.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
delena added inline comments.Dec 12 2015, 2:17 PM
22587 ↗(On Diff #42638)

Hi Simon,

Why you are doing this in PerformShuffleCombine and not in the lowerVectorShuffle()? As far as I understand, we call "combine" in order to combine multiple nodes. In this case, you just optimize one node.

congh added inline comments.Dec 12 2015, 2:57 PM
22530 ↗(On Diff #42638)

Nit: left space on +?

22598 ↗(On Diff #42638)

Early exit if AllLowerHalf is false?

22611 ↗(On Diff #42638)

Can we always guarantee that V is a 128-bit vector? I remember VECTOR_SHUFFLE can have different types for its operands and result.

RKSimon added inline comments.Dec 12 2015, 4:34 PM
22587 ↗(On Diff #42638)

Mainly because this has more in common with the 2 extract/insert patterns above than the canonicalization in lowerVectorShuffle. But I'm happy to move it (and the other 2?) there if you think necessary.

22611 ↗(On Diff #42638)

At this stage yes we know that the result and operands are all 256-bit vectors - shuffles in the DAG have to have consistent types. I did add the FIXME comment mentioning that this could be generalised to 256 or 512 bit vectors though - at that point it would need to be refactored.

delena added inline comments.Dec 12 2015, 11:06 PM
22587 ↗(On Diff #42638)

Yes. Thank you. It is the compilation time first of all.

3287 ↗(On Diff #42638)

This test and all tests bellow check the situation when one of the input vectors is not in use. But you, actually, optimize the case when the shuffle uses a half of V1 and a half of V2, right?

RKSimon updated this revision to Diff 42669.Dec 13 2015, 9:44 AM

As discussed, I've moved the patch into the canonicalization step of lowerVectorShuffle, along with the 2 existing extract/insert patterns from PerformShuffleCombine256.

Added more tests to demonstrate 'upper undef' 256-bit shuffles with 2 inputs - moving the code has also improved some other existing tests.

I haven't enabled 512-bit vector support/tests - I'd prefer to add this in a later commit - but I have generalised the existing 128-bit extract/insert code to be ready for it.

delena added inline comments.Dec 14 2015, 11:46 AM
11289 ↗(On Diff #42669)

Could, you, please take it into a static function?
May be call it from lower256BitVectorShuffle ?

11314 ↗(On Diff #42669)

Why do you check only UndefUpper? What about UndefLower?

11329 ↗(On Diff #42669)

I don't understand this code. You are running inside loop. for (unsigned i = 0; i != HalfNumElts; ++i)
for v32i8 you have 16 iterations. Do you create EXTRACT_SUBVECTOR 16 times?

RKSimon added inline comments.Dec 14 2015, 12:06 PM
11289 ↗(On Diff #42669)

No problem.

11314 ↗(On Diff #42669)

OK - I can add this - it will still initially just support shuffling with the lower half vectors is that OK?

11329 ↗(On Diff #42669)

The DAG.getNode logic will find a equivalent node if it already exists in the DAG (search for FindNodeOrInsertPos) so although we call getNode(ISD::EXTRACT_SUBVECTOR, ...) upto HalfNumElts times, it will return at most 2 values on success - on fail case it will return a 3rd value which will then fail to match either of the Half variables.

But I can see it won't be clear, and slower then necessary, so I'll replace it with an integer index approach.

RKSimon updated this revision to Diff 42838.Dec 15 2015, 5:33 AM

Updated based on Elena's feedback.

I've enabled 512-bit support without any trouble.

We are missing some vpermq/vpermpd patterns - maybe worth disabling 4i64/4f64 shuffles with UndefLower on AVX2? The latencies of the 2 approaches appear to be very similar.

delena added inline comments.Dec 17 2015, 12:26 AM
10363 ↗(On Diff #42838)

You can exit from function in this case. I mean "return SDValue()", right?

10378 ↗(On Diff #42838)

Do you leave the loop at this point?

11189 ↗(On Diff #42838)

I'm not sure that that is a beneficial variant for AVX-512.
Some reasons:

    • AVX-512 has many additional shuffles, we plan to add more patterns in the future
  • In KNL target (no VLX, BWI, DQ) you are moving down to AVX2, we have less registers, less shuffles.
  • SKX (withVLX, BWI, DQ) supports 256-bit vectors, but we currently don't have any additional optimization for 256-bit vectors on SKX.
RKSimon added inline comments.Dec 17 2015, 1:23 AM
10363 ↗(On Diff #42838)

Yes - I'll change it to return SDValue().

10378 ↗(On Diff #42838)

Yes - I'll change it to return SDValue().

11189 ↗(On Diff #42838)

OK - I'll drop AVX512 support.

RKSimon updated this revision to Diff 43113.Dec 17 2015, 2:57 AM

Updated based on Elena's feedback

delena added inline comments.Dec 17 2015, 3:42 AM
10367 ↗(On Diff #43113)

Let's assume that the original mask was:
0, 1, 3, 3, 8, 8, 10, 11
You want to take V1-Lo and V2-Lo
the new mask should be
0, 1, 3, 3, 4, 4, 6, 7
But M %= NumElts will not convert 8 to 4 and 10 to 6.

10394 ↗(On Diff #43113)

extra line

RKSimon added inline comments.Dec 17 2015, 3:49 AM
10367 ↗(On Diff #43113)

I think this is for a v816 - then the modulo will create:

0, 1, 3, 3, 0, 0, 2, 3

The lines below then offset the second half vector by halfnumelts:

0, 1, 3, 3, 4, 4, 6, 7

delena added inline comments.Dec 17 2015, 4:35 AM
10367 ↗(On Diff #43113)

Lets assume that original VT was v8i32. NumElts = 8.
M %= NumElts -?
may be
M %= HalfNumElts ?

RKSimon updated this revision to Diff 43251.Dec 18 2015, 10:34 AM

Thank Elena - you're right my indexing would only have worked for lower half vectors, fine for the current setup but not if we generalise it in the future.

delena edited edge metadata.Dec 21 2015, 12:18 AM

Hi Simon,

You have 2 input vectors A and B.
Let's see them as (a0, a1) and (b0, b1). You are looking for the one of the following patterns, based on mask:

  1. (a0, u), (b0, u)
  2. (u, a1), (b0, u)
  3. (a0, u), (u, b1)
  4. (u, a1), (b0, u)

One pass through the mask vector will map the original shuffle to one of the 4 patterns, if possible.
Then you extract lower or upper part from A and from B according to the pattern.
And calculate the mask as following:
if (M[i] < NumElts) // vector A

Newmask[i] =  M[i] % HalfNumElts

else // vector B

Newmask[i] =  (M[i] % HalfNumElts) + HalfNumElts
RKSimon updated this revision to Diff 43391.Dec 21 2015, 12:46 PM
RKSimon edited edge metadata.

Disabled v4f64/v4i64 lowering on AVX2 - VPPERMQ/VPERMPD are faster.

Made the referencing of the half vectors clearer - the code is capable of referencing any 2 of the 4 half vectors, but we only permit the use of the lower halves as there is a definite perf gain there.

delena accepted this revision.Dec 22 2015, 10:56 PM
delena edited edge metadata.


This revision is now accepted and ready to land.Dec 22 2015, 10:56 PM

Thanks Elena

This revision was automatically updated to reflect the committed changes.