This is an archive of the discontinued LLVM Phabricator instance.

[X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursing
ClosedPublic

Authored by lebedev.ri on Sep 1 2021, 8:01 AM.

Details

Summary

This was suggested in https://reviews.llvm.org/D108382#inline-1039018,
maybe hopefully this will help avoid regressions in that patch.

Diff Detail

Event Timeline

lebedev.ri created this revision.Sep 1 2021, 8:01 AM
lebedev.ri requested review of this revision.Sep 1 2021, 8:01 AM
RKSimon added inline comments.Sep 1 2021, 8:48 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
37920

Op might be a different width to the Root - see the "Widen any subvector shuffle inputs we've collected." code below.

lebedev.ri updated this revision to Diff 369961.Sep 1 2021, 9:45 AM
lebedev.ri marked an inline comment as done.

Move the code after widening.
TBN, i don't believe this helps D108382.

Rebased ontop of D109074, NFC.

lebedev.ri added inline comments.Sep 1 2021, 11:09 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
37920

I keep hitting the same pitfail.

lebedev.ri updated this revision to Diff 370077.Sep 1 2021, 3:00 PM

Rebased, NFC.

RKSimon added inline comments.Sep 7 2021, 3:16 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
37920

We still need to do this before the widenSubVector() code - otherwise we'll never be able to simplify any input that doesn't match RootSizeInBits, which are likely to be the most interesting cases imo.

37972

This seems to be really bulky for what its actually doing. I don't think we need to create this shuffle mask for instance, we should be able to create a demanded elts mask directly and then trunc/scale it for the input's size.

I keep meaning to create a scaleDemandedMask() common helper method as we have several places that would use it (e.g. SelectionDAG.computeKnownBits bitcast handling and other parts of value tracking).

lebedev.ri added inline comments.Sep 7 2021, 3:20 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
37972

That is what what i initially came up with, and it's *much* uglier than this code :)
I can do that again, but i'm not sure that will be be better.

lebedev.ri updated this revision to Diff 371298.Sep 8 2021, 3:07 AM
lebedev.ri marked an inline comment as not done.

Introduce ScaleDemandedEltsMask() and use it.

llvm/lib/Target/X86/X86ISelLowering.cpp
37920

I agree, but is this a correctness concern for this patch?

37972

Ok, how about this?

RKSimon added inline comments.Sep 8 2021, 3:27 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
37607

We should assert that (NumElts % NumSrcElts) == 0 || (NumSrcElts % NumElts) == 0 - or return true/false on success/failure.

37920

what correctness?

lebedev.ri added inline comments.Sep 8 2021, 3:29 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
37607

Oops, i meant to do that, but forgot to in the end.

37920

I mean, if we don't do this in this patch, will that lead to miscompiles, or simply to missed optimizations?

lebedev.ri updated this revision to Diff 371303.Sep 8 2021, 3:32 AM
lebedev.ri marked an inline comment as done.

Add forgotten assert.

RKSimon added inline comments.Sep 8 2021, 3:52 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
37603

This can probably move to the APIntOps helpers

37981

To move this before widening, we should just need to truncate OpDemandedElts based on its size vs RootSizeInBits - we should assert that no lost elts were demanded. Then we can scale it.

lebedev.ri updated this revision to Diff 371316.Sep 8 2021, 4:56 AM

Try to move before widenSubVector() - now without miscompiles?

llvm/lib/Target/X86/X86ISelLowering.cpp
37603

Let's do that afterwards?

37981

Ok, i admit i've tried to avoid doing that because i don't quite understand all of the logic here.
Does this look right? It avoids the miscompiles that were visible in some previous attempt at least.

Did i get it right this time? :)

RKSimon added inline comments.Sep 13 2021, 3:21 AM
llvm/test/CodeGen/X86/insertelement-ones.ll
389 ↗(On Diff #371316)

Any luck on improving this?

lebedev.ri added inline comments.Sep 13 2021, 1:44 PM
llvm/test/CodeGen/X86/insertelement-ones.ll
389 ↗(On Diff #371316)

This one is obscure.
I believe the problem is X86ISelLowering.cpp's matchBinaryShuffle()'s ISD::OR lowering.

We have:

mask:  0 1 2 3 4 5 6 7 8 9 10 11 12 13 30 -2

matchBinaryShuffle()
EltSizeInBits: 8
V1:
t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1
  t3: v16i8 = Register %1
V2:
t74: v16i8 = X86ISD::VSHLDQ t51, TargetConstant:i8<14>
  t51: v16i8 = bitcast t50
    t50: v4i32 = scalar_to_vector Constant:i32<255>
      t49: i32 = Constant<255>
  t73: i8 = TargetConstant<14>

We can't say anything about t4, but i think it's obvious that t74 is actually
an all-zeros except the 14'th element, which is all-ones.
So we of course can lower that as an or blend, and we do not care what t4 is.
But the code fails to do that.

I think we'd basically have to do computeKnownBits() for each element of V1/V2 separately.

Should i keep looking?

lebedev.ri added inline comments.
llvm/test/CodeGen/X86/insertelement-ones.ll
389 ↗(On Diff #371316)

Ok, got it: D109726

lebedev.ri marked an inline comment as done.

Rebased ontop main+D109726.
The noted regression is gone (but many more took it's place.)

Rebased, NFC.

lebedev.ri added inline comments.Sep 17 2021, 9:30 AM
llvm/test/CodeGen/X86/insertelement-ones.ll
311 ↗(On Diff #373251)

Here we have:

Optimized legalized selection DAG: %bb.0 'insert_v16i8_x123456789ABCDEx:'
SelectionDAG has 20 nodes:
  t0: ch = EntryToken
          t2: v16i8,ch = CopyFromReg t0, Register:v16i8 %0
        t19: v16i8 = and t2, t36
        t20: v16i8 = X86ISD::ANDNP t36, t27
      t21: v16i8 = or t19, t20
      t33: v16i8 = X86ISD::VSHLDQ t27, TargetConstant:i8<15>
    t45: v16i8 = or t21, t33
  t12: ch,glue = CopyToReg t0, Register:v16i8 $xmm0, t45
    t26: v4i32 = scalar_to_vector Constant:i32<255>
  t27: v16i8 = bitcast t26
    t38: i64 = X86ISD::Wrapper TargetConstantPool:i64<<16 x i8> <i8 0, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>> 0
  t36: v16i8,ch = load<(load (s128) from constant-pool)> t0, t38, undef:i64
  t13: ch = X86ISD::RET_FLAG t12, TargetConstant:i32<0>, Register:v16i8 $xmm0, t12:1

... so matchBinaryShuffle() again fails to omit the masking,
even though it's obviously redundant here for the reasons seen in D109726.
I would suspect that is because around scalar_to_vector we operate on i32 elt type,
so we don't have all-ones elements until after bitcast.
Without changing computeKnownBits to operate on a specified element width,
i'm not sure it can help us further, and that does not sound like the right fix.

RKSimon added inline comments.Sep 17 2021, 9:30 AM
llvm/test/CodeGen/X86/insertelement-ones.ll
315 ↗(On Diff #373251)

We're going to have to improve INSERT_VECTOR_ELT handling of 0/-1 elements - just AND/OR if we don't have a legal PINSRB instruction (pre-SSE41).

llvm/test/CodeGen/X86/oddshuffles.ll
2268

Looks like we're missing a fold to share scalar_to_vector(x) and scalar_to_vector(trunc(x)) (maybe worth supporting scalar_to_vector(ext(x)) as well)?

RKSimon added inline comments.Sep 17 2021, 10:53 AM
llvm/test/CodeGen/X86/insertelement-ones.ll
315 ↗(On Diff #373251)

It looks like we might be able to do this more easily by extending lowerShuffleAsBitMask to handle the allones elements case as well as the zero elements case.

lebedev.ri added inline comments.Sep 17 2021, 11:05 AM
llvm/test/CodeGen/X86/insertelement-ones.ll
315 ↗(On Diff #373251)

Note that X86TargetLowering::LowerINSERT_VECTOR_ELT isn't even called for this test,
since we expand, not legalize, in this case.
Marking it as legalize causes crashes "don't know how to legalize",
i guess it doesn't retry to legalize via the generic expansion.

lebedev.ri marked 2 inline comments as done.

Rebased ontop of D109989 - llvm/test/CodeGen/X86/insertelement-ones.ll is all good now.

lebedev.ri added inline comments.
llvm/test/CodeGen/X86/insertelement-ones.ll
315 ↗(On Diff #373251)

Done, D109989.

lebedev.ri added inline comments.Sep 17 2021, 11:52 AM
llvm/test/CodeGen/X86/oddshuffles.ll
2268

This stuff is broken.
We've in AVX1-more, and only have broadcast-from-mem,
yet we've successfully obscured the load via the truncation.

I suppose we could look past ext/trunc of scalar_to_vector operand,
and change it to bitcast/ext of scalar_to_vector itself,
let me see.

Optimized legalized selection DAG: %bb.0 'splat_v3i32:'
SelectionDAG has 28 nodes:
  t0: ch = EntryToken
    t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t27: i64,ch = load<(load (s64) from %ir.ptr, align 1)> t0, t2, undef:i64
      t24: v8i32 = BUILD_VECTOR Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
          t30: v2i64 = scalar_to_vector t27
        t107: v4i64 = insert_subvector undef:v4i64, t30, Constant:i64<0>
      t108: v8i32 = bitcast t107
    t101: v8i32 = X86ISD::BLENDI t24, t108, TargetConstant:i8<2>
  t19: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t101
      t25: v8i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
          t91: i32 = truncate t27
        t92: v4i32 = X86ISD::VBROADCAST t91
      t94: v8i32 = insert_subvector undef:v8i32, t92, Constant:i64<0>
    t97: v8i32 = X86ISD::BLENDI t25, t94, TargetConstant:i8<4>
  t21: ch,glue = CopyToReg t19, Register:v8i32 $ymm1, t97, t19:1
  t22: ch = X86ISD::RET_FLAG t21, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, t21:1
lebedev.ri added inline comments.Sep 17 2021, 12:40 PM
llvm/test/CodeGen/X86/oddshuffles.ll
2268

While something like this should work:

diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 5a49f33e46fe..4d7c2c2a8651 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -21824,6 +21824,13 @@ SDValue DAGCombiner::visitSCALAR_TO_VECTOR(SDNode *N) {
     }
   }
 
+  // Fold SCALAR_TO_VECTOR(TRUNCATE(V)) to SCALAR_TO_VECTOR(V),
+  // by making trucation of the operand implicit.
+  if (InVal.getOpcode() == ISD::TRUNCATE && VT.isFixedLengthVector() &&
+      Level < AfterLegalizeDAG)
+    return DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(N), VT,
+                       InVal->getOperand(0));
+
   return SDValue();
 }
 
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 09ba7af6e38a..695cc8303cc1 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -14076,10 +14076,12 @@ static SDValue lowerShuffleAsBroadcast(const SDLoc &DL, MVT VT, SDValue V1,
 
   // If this is a scalar, do the broadcast on this type and bitcast.
   if (!V.getValueType().isVector()) {
-    assert(V.getScalarValueSizeInBits() == NumEltBits &&
-           "Unexpected scalar size");
-    MVT BroadcastVT = MVT::getVectorVT(V.getSimpleValueType(),
-                                       VT.getVectorNumElements());
+    if(V.getValueType().isInteger() &&
+       V.getScalarValueSizeInBits() > NumEltBits)
+      V = DAG.getNode(ISD::TRUNCATE, DL, VT.getScalarType(), V);
+    assert(V.getScalarValueSizeInBits() == NumEltBits && "Unexpected scalar size");
+    MVT BroadcastVT =
+        MVT::getVectorVT(V.getSimpleValueType(), VT.getVectorNumElements());
     return DAG.getBitcast(VT, DAG.getNode(Opcode, DL, BroadcastVT, V));
   }

it doesn't catch anything with the cut-off,
and without it. it exposes numerous places that don't expect this truncation to be implicit.

lebedev.ri added inline comments.Sep 18 2021, 3:22 AM
llvm/test/CodeGen/X86/oddshuffles.ll
2268

I've looked again, and i'm not sure i have enough motivation to tackle all the fallout from the
scalar_to_vector(trunc(x)) --> scalar_to_vector(x) fold, unless i misunderstood the suggestion.

Rebased, NFC.

RKSimon accepted this revision.Sep 19 2021, 5:00 AM

LGTM

llvm/lib/Target/X86/X86ISelLowering.cpp
37915

Lo + Hi

llvm/test/CodeGen/X86/oddshuffles.ll
2268

That's OK - I'll take a look when I get the chance.

This revision is now accepted and ready to land.Sep 19 2021, 5:00 AM
lebedev.ri marked 2 inline comments as done.Sep 19 2021, 7:19 AM

LGTM

:/
Thank you for the review!

llvm/test/CodeGen/X86/oddshuffles.ll
2268

Sorry :/

This revision was landed with ongoing or failed builds.Sep 19 2021, 7:25 AM
This revision was automatically updated to reflect the committed changes.
lebedev.ri marked an inline comment as done.