This is an archive of the discontinued LLVM Phabricator instance.

[WIP][X86] combineX86ShuffleChain(): canonicalize mask elts picking from splats
AbandonedPublic

Authored by lebedev.ri on Jul 28 2021, 3:35 PM.

Details

Summary

I believe this is correct given that the splat detection matches the one just above the new code,
however i'm not quite sure this makes sense overall. Does this make sense to do?

My basic thought is that we might be able to get less regressions by handling multiple
insertions of the same value into a vector if we form broadcasts+blend here,
as opposed to D105390, but i have not really thought this through,
and did not try implementing it yet.

Diff Detail

Event Timeline

lebedev.ri created this revision.Jul 28 2021, 3:35 PM
lebedev.ri requested review of this revision.Jul 28 2021, 3:35 PM

My main question here - does this conceptually make sense, or is this generally wrong to do here?

A few comments

llvm/lib/Target/X86/X86ISelLowering.cpp
35830

Both V1 and V2 could be smaller than RootSizeInBits - we need to either not try to do this in that case or bail in the loop below if we try to reference a 'identity' element higher than V1 or V2's width.

35831

Why not NewMask.assign(BaseMask.begin(), BaseMask.end()) - then we just need to adjust the masks instead of repeated emplace_back().

35834

Use isUndefOrInRange ?

35847

This looks over-complicated? Merge some of the comments and avoid nested if()s

35851

BaseMask = std::move(NewMask);

lebedev.ri marked 4 inline comments as done.

Thanks for taking a look!
Addressing review notes - small refactor.

llvm/lib/Target/X86/X86ISelLowering.cpp
35782–35783

Didn't we just assert that the inputs have the same size as root?

RKSimon added inline comments.Aug 3 2021, 12:20 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
35782–35783

The middle-term plan is to relax these to (RootSizeInBit % VT1.getSizeInBits()) == 0 - we're getting pretty close and that will let us stop widening in combineX86ShufflesRecursively - which is causing hasOneUse() problems - and just widen when a match is found (which is what CanonicalizeShuffleInput will do). This should also help use get rid of combineX86ShuffleChainWithExtract.

But I can understand if you don't want to handle this yet.

35821

Is this any better?
'''
bool IsSplat0 = isTargetShuffleSplat(V0);
bool IsSplat1 = isTargetShuffleSplat(V1);
IsSplat0 &= !!(BaseMaskEltSizeInBits % V0.getScalarValueSizeInBits());
IsSplat1 &= !!(BaseMaskEltSizeInBits % V1.getScalarValueSizeInBits());
'''

lebedev.ri marked 2 inline comments as done.

Addressing review feedback: future-proofing.

llvm/lib/Target/X86/X86ISelLowering.cpp
35782–35783

Right. I'm aware that this is an artificial (and subpar) restriction.

35830

Okay, does this look about right?

RKSimon accepted this revision.Aug 4 2021, 6:31 AM

LGTM with a couple of minors - cheers!

llvm/lib/Target/X86/X86ISelLowering.cpp
35800

Add brief description comment

35829

Move this inside the if() ?

This revision is now accepted and ready to land.Aug 4 2021, 6:31 AM
lebedev.ri marked an inline comment as done.Aug 4 2021, 6:39 AM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
35829

This can't work, because BaseMask is an ArrayRef,
so if we narrow the scope of NewMask,
then we can't assign it to BaseMask,
because that would result in use-after-scope.

Later in the function we have SmallVector<int, 64> Mask; with the same purpose,
so the solution is to change the code inbetween to use NewMask instead of BaseMask,
but that is a separate change that i will do afterwards.

lebedev.ri updated this revision to Diff 364087.Aug 4 2021, 6:45 AM
lebedev.ri marked 2 inline comments as done.

Thank you for the review!
Adding description to InputNumElts.

This revision was landed with ongoing or failed builds.Aug 4 2021, 6:55 AM
This revision was automatically updated to reflect the committed changes.
RKSimon added inline comments.Aug 4 2021, 7:09 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
35829

OK - got it - cheers

lebedev.ri marked an inline comment as done.Aug 4 2021, 7:16 AM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
35829
lebedev.ri reopened this revision.Aug 5 2021, 11:34 AM
lebedev.ri marked an inline comment as done.

This was reverted in rGbd17ced1db9a674fc8aa6632899e245672c7aa35.

<and so on>

Legalizing: t300: v4i32,ch = X86ISD::VBROADCAST_LOAD<(load (s32) from %ir.input + 48, align 16)> t0, t302
Legal node: nothing to do

Combining: t300: v4i32,ch = X86ISD::VBROADCAST_LOAD<(load (s32) from %ir.input + 48, align 16)> t0, t302

Legalizing: t2574: v4i32 = X86ISD::UNPCKL t250, t300
Legal node: nothing to do

Combining: t2574: v4i32 = X86ISD::UNPCKL t250, t300
Creating constant: t2575: i8 = TargetConstant<10>
Creating new node: t2576: v4i32 = X86ISD::BLENDI t250, t300, TargetConstant:i8<10>
 ... into: t2576: v4i32 = X86ISD::BLENDI t250, t300, TargetConstant:i8<10>

Legalizing: t250: v4i32 = extract_subvector t259, Constant:i64<0>
Legal node: nothing to do

Combining: t250: v4i32 = extract_subvector t259, Constant:i64<0>

Legalizing: t300: v4i32,ch = X86ISD::VBROADCAST_LOAD<(load (s32) from %ir.input + 48, align 16)> t0, t302
Legal node: nothing to do

Combining: t300: v4i32,ch = X86ISD::VBROADCAST_LOAD<(load (s32) from %ir.input + 48, align 16)> t0, t302

Legalizing: t2576: v4i32 = X86ISD::BLENDI t250, t300, TargetConstant:i8<10>
Legal node: nothing to do

Combining: t2576: v4i32 = X86ISD::BLENDI t250, t300, TargetConstant:i8<10>

Legalizing: t2575: i8 = TargetConstant<10>

Combining: t2575: i8 = TargetConstant<10>

Legalizing: t253: v8i32 = insert_subvector undef:v8i32, t2576, Constant:i64<0>
Legal node: nothing to do

Combining: t253: v8i32 = insert_subvector undef:v8i32, t2576, Constant:i64<0>

Legalizing: t295: v4i32 = X86ISD::BLENDI t194, t195, TargetConstant:i8<12>
Legal node: nothing to do

Combining: t295: v4i32 = X86ISD::BLENDI t194, t195, TargetConstant:i8<12>
Creating new node: t2577: v4i32 = X86ISD::UNPCKL t250, t300

Replacing.2 t2576: v4i32 = X86ISD::BLENDI t250, t300, TargetConstant:i8<10>

With: t2577: v4i32 = X86ISD::UNPCKL t250, t300

<and so on>
This revision is now accepted and ready to land.Aug 5 2021, 11:34 AM

Ok, looks like that is a, traditional for this transform,
case where we have two conflicting decisions,
and we keep switching between them.

I'm not quite sure what's the traditional approach to dealing with that in combineX86ShuffleChain() is?
I could guard this new transform with the check !(Depth == 0 && Root.getOpcode() == X86ISD::UNPCKL),
and we then end up with UNPCKL. Or we could guard the place where we'd match X86ISD::UNPCKL
with an opposite transform, and keep the X86ISD::BLENDI. What's better? What is the right way?

combineX86ShuffleChain is very clear - its highest preference is: unary, then unary+permute, then binary, then binary+permute, then the various unary/unary+zero/binary variable shuffles.

These things tend to occur if some of the elements are constants/SM_SentinelUndef/SM_SentinelZero values and combineX86ShufflesRecursively gets stuck because it loses track of the sentinels at different recursion depths.

Do you have the shuffle masks that are being lowered to the blend / unpack cases?

combineX86ShuffleChain is very clear - its highest preference is: unary, then unary+permute, then binary, then binary+permute, then the various unary/unary+zero/binary variable shuffles.

Aha.
In the case in question we seem to be alternating between two binary shuffle candidates - X86ISD::UNPCKL and X86ISD::BLENDI.

These things tend to occur if some of the elements are constants/SM_SentinelUndef/SM_SentinelZero values and combineX86ShufflesRecursively gets stuck because it loses track of the sentinels at different recursion depths.

Do you have the shuffle masks that are being lowered to the blend / unpack cases?

Yes, from @bkramer, added at rG16605aea844060c2973df079c9f91650132c099d

lebedev.ri abandoned this revision.Aug 6 2021, 8:15 AM

TLDR: as i suspected, this is conceptually wrong, and should not have been accepted in the first place.
The individual matchers should instead be splat-aware, much like they are already zero/undef-aware
Indeed, there is some rudimentary support for that in e.g. isTargetShuffleEquivalent()/IsElementEquivalent().

I'm starting to think we might be better off tweaking matchShuffleAsBlend (amongst other matchShuffleAs* methods) to make use of IsElementEquivalent which does something similar to this patch but does it as part of the shuffle mask matching - if you can extend it to handle broadcasts with different element widths it should help as much as this.

I'm starting to think we might be better off tweaking matchShuffleAsBlend (amongst other matchShuffleAs* methods) to make use of IsElementEquivalent which does something similar to this patch but does it as part of the shuffle mask matching - if you can extend it to handle broadcasts with different element widths it should help as much as this.

That was precisely my comment, yes.

TLDR: as i suspected, this is conceptually wrong, and should not have been accepted in the first place.
The individual matchers should instead be splat-aware, much like they are already zero/undef-aware
Indeed, there is some rudimentary support for that in e.g. isTargetShuffleEquivalent()/IsElementEquivalent().

Didn't see this - glad we think along the same lines :)