This is an archive of the discontinued LLVM Phabricator instance.

[X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are broadcastable/identities, canonicalize broadcasts as such
ClosedPublic

Authored by lebedev.ri on Aug 19 2021, 8:44 AM.

Diff Detail

Event Timeline

lebedev.ri created this revision.Aug 19 2021, 8:44 AM
lebedev.ri edited the summary of this revision. (Show Details)Aug 19 2021, 8:55 AM
RKSimon added inline comments.Aug 19 2021, 9:32 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
12665

At least a summary comment would be useful.

12666

Instead of passing Input by reference - why not return it? It just makes it look messy imo.

12672

Would it be better to assert(isBroadcastShuffleMask(InputMask)) ? The isNoopOrBroadcastShuffleMask checks below should ensure it no?

12673

Why not just create a X86ISD::VBROADCAST node? This code is AVX only and we have isel patterns that handle AVX1 cases where load folding fails.

lebedev.ri marked 3 inline comments as done.

@RKSimon thank you for taking a look!
Hopefully addressed review notes.

Note that i'm not quite sure that this is the right way.
This paves road for D108253 that fixes https://bugs.llvm.org/show_bug.cgi?id=50971

llvm/lib/Target/X86/X86ISelLowering.cpp
12666

Note that we also modify InputMask - we turn it into an identity mask.

12672

Sure, that can work now.

12673

Oh, hmm. I have not considered that, and that changes the results somewhat...

Rebased, NFC.

lebedev.ri added inline comments.Aug 19 2021, 2:18 PM
llvm/test/CodeGen/X86/copy-low-subvec-elt-to-high-subvec-elt.ll
287

This regression is being fixed by D108411.

Rebased, NFC.

RKSimon added inline comments.Aug 25 2021, 10:21 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
12667

can this comment reduce to 2 lines? it doesn't seem to be 80col

llvm/test/CodeGen/X86/oddshuffles.ll
2284 ↗(On Diff #368087)

any luck with this?

lebedev.ri added inline comments.Aug 25 2021, 1:30 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
12667

All my commits are clang-formatted, so this did fit within 80-col limit.
Is this better?

Aargh, phab ate my inline comment :/

lebedev.ri added inline comments.Aug 25 2021, 1:35 PM
llvm/test/CodeGen/X86/oddshuffles.ll
2284 ↗(On Diff #368087)

I wrote a comment here, and phab just lost it :(

This seems like demandedelts failure.
In LHS, we successfully dropped this load.
Whenever in SimplifyMultipleUseDemandedBits() we look at this insert_vector_elt,
demandedelts implies that we demand all elements.
The problem is that we need to decode the target shuffle mask to notice that, i think.

Wild guess: perhaps in SimplifyMultipleUseDemandedBitsForTargetNode() after getTargetShuffleInputs(),
we can call SimplifyMultipleUseDemandedBits() on inputs, and recreate the shuffle if that succeeded?
I'm not really sure if there is some other better place to do that.

lebedev.ri added inline comments.Aug 26 2021, 6:23 AM
llvm/test/CodeGen/X86/oddshuffles.ll
2284 ↗(On Diff #368087)

Actually, that won't work either.

Optimized legalized selection DAG: %bb.0 'splat_v3i32:'
SelectionDAG has 32 nodes:
  t0: ch = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
      t24: v8i32 = BUILD_VECTOR Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
    t55: v8i32 = X86ISD::BLENDI t24, t58, TargetConstant:i8<2>
  t19: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t55
        t69: v32i8 = bitcast t58
          t76: i64 = X86ISD::Wrapper TargetConstantPool:i64<<32 x i8> <i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 0, i8 1, i8 2, i8 3, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>> 0
        t74: v32i8,ch = load<(load (s256) from constant-pool)> t0, t76, undef:i64
      t71: v32i8 = X86ISD::PSHUFB t69, t74
    t72: v8i32 = bitcast t71
  t21: ch,glue = CopyToReg t19, Register:v8i32 $ymm1, t72, t19:1
          t27: i64,ch = load<(load (s64) from %ir.ptr, align 1)> t0, t2, undef:i64
        t30: v2i64 = scalar_to_vector t27
      t31: v4i32 = bitcast t30
        t28: i64 = add nuw t2, Constant:i64<8>
      t29: i32,ch = load<(load (s32) from %ir.ptr + 8, align 1)> t0, t28, undef:i64
    t61: v4i32 = insert_vector_elt t31, t29, Constant:i64<2>
  t58: v8i32 = insert_subvector undef:v8i32, t61, Constant:i64<0>
  t22: ch = X86ISD::RET_FLAG t21, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, t21:1


===== Instruction selection begins: %bb.0 ''

t29/t61 is what we want to drop, but even if we could recreate subreg widening, t58 has two uses.
So i guess our only hope is combineX86ShufflesRecursively()?

RKSimon added inline comments.Sep 1 2021, 3:13 AM
llvm/test/CodeGen/X86/oddshuffles.ll
2284 ↗(On Diff #368087)

What might work is calling SimplifyMultipleUseDemandedVectorElts on each operand at the end of the combineX86ShufflesRecursively recursion just before the calls into combineX86ShuffleChain?

lebedev.ri added inline comments.Sep 1 2021, 3:58 AM
llvm/test/CodeGen/X86/oddshuffles.ll
2284 ↗(On Diff #368087)

Let's see...

lebedev.ri added inline comments.Sep 1 2021, 9:48 AM
llvm/test/CodeGen/X86/oddshuffles.ll
2284 ↗(On Diff #368087)

I've locally rebased this patch ontop of the suggestion which i've implemented in D109065. and it does not help.

Rebased ontop of D109074+D109065, the regression is gone.

lebedev.ri updated this revision to Diff 370078.Sep 1 2021, 3:03 PM

Rebased, NFC.

RKSimon accepted this revision.Sep 7 2021, 2:43 PM

LGTM - cheers

This revision is now accepted and ready to land.Sep 7 2021, 2:43 PM

LGTM - cheers

Aha! Thank you for the review.
This depends on D109065, so these two accepted patches in this patch series can't land just yet.

Rebased, NFC.
Going to land this now.

This revision was landed with ongoing or failed builds.Sep 19 2021, 7:36 AM
This revision was automatically updated to reflect the committed changes.