Page MenuHomePhabricator

[X86, AVX] instcombine common cases of vperm2* intrinsics into shuffles
ClosedPublic

Authored by spatel on Mar 20 2015, 10:23 AM.

Details

Summary

vperm2* intrinsics are just shuffles unless a zero mask bit is set. In a few special cases, they're not even shuffles.

Optimizing intrinsics in InstCombine is better than handling this in the front-end for at least two reasons:

  1. Optimizing custom-written SSE intrinsic code at -O0 makes vector coders really angry (and so I have some regrets about some patches from last week).
  2. Doing mask conversion logic in header files is hard to write and subsequently read.

Unfortunately, we use a magic number (generally assumed to be -1) to specify undef values in shufflevector masks in IR. And apparently, that magic has led to lax coding where we just check <0 for undef. If we had a proper enum for shufflevector mask special values, we could do like the x86 backend has done and easily transform the zero mask bit cases here too. Fixing that could be a follow-on patch. Otherwise, we'll try to deal with matching a 2-shuffle sequence in the x86 backend. But again, that's a separate patch (see the TODO comment in this one).

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 22354.Mar 20 2015, 10:23 AM
spatel retitled this revision from to [X86, AVX] instcombine common cases of vperm2* intrinsics into shuffles.
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added a subscriber: Unknown Object (MLST).
andreadb edited edge metadata.Mar 20 2015, 12:01 PM

Hi Sanjay,

lib/Transforms/InstCombine/InstCombineCalls.cpp
203 ↗(On Diff #22354)

Why not have a static function instead? If you use a function, you can avoid to to change 'InstCombineInternal.h'.

245–248 ↗(On Diff #22354)

I don't think you need two shuffles for this case.
If either b[3] or b[7] is set, you can always convert the 'vperm2f128' into a single shuffle instruction where one of the operands is a zero vector, and the other operand is either Op0 or Op2.

Example:
With mask 'Imm' equal to '0x08', the low 128-bit lane of the result is always going to be zero.
The upper 128-bit lane is goint to be Op0[127:0]. So, you can expand the input 'x86_avx_vperm2f128_ps_256' into:

shufflevector <8 x float> zeroinitializer, <8 x float> Op0, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>

Basically, what I am trying to say is that if a 'zero bit' is set, then one of the input vectors becomes redundant.

You can reuse the same logic you added for the case where 'zero mask bits are not set' to handle _all_ the possible cases. You would only need to check in advance if zero bits are set, and replace either Op0 or Op1 (or both) with a zero vector if needed.

spatel added inline comments.Mar 20 2015, 12:45 PM
lib/Transforms/InstCombine/InstCombineCalls.cpp
203 ↗(On Diff #22354)

No reason other than cut and paste from the code around this. If this is static, we'd have to pass in a reference to the InstCombiner in order to use ReplaceInstUsesWith(), right? Do you think that's cleaner?

245–248 ↗(On Diff #22354)

Ah, nice catch. I'll adjust the comment and make that a small follow-on patch if it's ok with you.

spatel added inline comments.Mar 20 2015, 12:50 PM
lib/Transforms/InstCombine/InstCombineCalls.cpp
203 ↗(On Diff #22354)

I think the use of 'Builder' would also require the InstCombiner reference.

andreadb added inline comments.Mar 20 2015, 1:00 PM
lib/Transforms/InstCombine/InstCombineCalls.cpp
203 ↗(On Diff #22354)

I was thinking about something like this:

if (Value *V = SimplifyX86vperm2(*II, Builder))
 return ReplaceInstUsesWith(II, V);

Basically, the call to 'ReplaceInstUsesWith' could be done by the caller rather than by function 'SimplifyX86vperm2'. Your function can then take as second argument a pointer to the IRBuilder.
So, something like:

static Value *simplifyX86Vperm2f128(IntrinsicInst &II, InstCombiner::BuilderTy *Builder) {...}

That said, this was just an idea. I don't know if this is a better/cleaner way for doing it.. but at least (I thought) we avoid to change the header file. :-)

spatel added inline comments.Mar 20 2015, 1:18 PM
lib/Transforms/InstCombine/InstCombineCalls.cpp
203 ↗(On Diff #22354)

Ok, as long as I'm adding the minimal number of lines to this already 900 line switch statement. :)

I'll update the patch shortly.

spatel updated this revision to Diff 22373.Mar 20 2015, 1:31 PM
spatel edited edge metadata.

Patch updated based on feedback from Andrea:

  1. Made the simplify function static - avoids header changes w/o adding any extra lines to the switch
  2. Fixed comment regarding the zero mask case - will add that in a future patch
  3. Moved some local variables closer to their usage
  4. Added TODO for the equivalent AVX2 (integer) intrinsic
andreadb accepted this revision.Mar 20 2015, 2:21 PM
andreadb edited edge metadata.

LGTM. Thanks Sanjay.

This revision is now accepted and ready to land.Mar 20 2015, 2:21 PM
This revision was automatically updated to reflect the committed changes.