This was suggested in https://reviews.llvm.org/D108382#inline-1039018,
maybe hopefully this will help avoid regressions in that patch.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 37920 | Op might be a different width to the Root - see the "Widen any subvector shuffle inputs we've collected." code below. | |
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 37920 | I keep hitting the same pitfail. | |
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 37920 | We still need to do this before the widenSubVector() code - otherwise we'll never be able to simplify any input that doesn't match RootSizeInBits, which are likely to be the most interesting cases imo. | |
| 37972 | This seems to be really bulky for what its actually doing. I don't think we need to create this shuffle mask for instance, we should be able to create a demanded elts mask directly and then trunc/scale it for the input's size. I keep meaning to create a scaleDemandedMask() common helper method as we have several places that would use it (e.g. SelectionDAG.computeKnownBits bitcast handling and other parts of value tracking). | |
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 37972 | That is what what i initially came up with, and it's *much* uglier than this code :) | |
Try to move before widenSubVector() - now without miscompiles?
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 37603 | Let's do that afterwards? | |
| 37981 | Ok, i admit i've tried to avoid doing that because i don't quite understand all of the logic here. | |
| llvm/test/CodeGen/X86/insertelement-ones.ll | ||
|---|---|---|
| 389 ↗ | (On Diff #371316) | Any luck on improving this? | 
| llvm/test/CodeGen/X86/insertelement-ones.ll | ||
|---|---|---|
| 389 ↗ | (On Diff #371316) | This one is obscure. We have: mask:  0 1 2 3 4 5 6 7 8 9 10 11 12 13 30 -2
matchBinaryShuffle()
EltSizeInBits: 8
V1:
t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1
  t3: v16i8 = Register %1
V2:
t74: v16i8 = X86ISD::VSHLDQ t51, TargetConstant:i8<14>
  t51: v16i8 = bitcast t50
    t50: v4i32 = scalar_to_vector Constant:i32<255>
      t49: i32 = Constant<255>
  t73: i8 = TargetConstant<14>We can't say anything about t4, but i think it's obvious that t74 is actually I think we'd basically have to do computeKnownBits() for each element of V1/V2 separately. Should i keep looking? | 
Rebased ontop main+D109726.
The noted regression is gone (but many more took it's place.)
| llvm/test/CodeGen/X86/insertelement-ones.ll | ||
|---|---|---|
| 311 ↗ | (On Diff #373251) | Here we have: Optimized legalized selection DAG: %bb.0 'insert_v16i8_x123456789ABCDEx:'
SelectionDAG has 20 nodes:
  t0: ch = EntryToken
          t2: v16i8,ch = CopyFromReg t0, Register:v16i8 %0
        t19: v16i8 = and t2, t36
        t20: v16i8 = X86ISD::ANDNP t36, t27
      t21: v16i8 = or t19, t20
      t33: v16i8 = X86ISD::VSHLDQ t27, TargetConstant:i8<15>
    t45: v16i8 = or t21, t33
  t12: ch,glue = CopyToReg t0, Register:v16i8 $xmm0, t45
    t26: v4i32 = scalar_to_vector Constant:i32<255>
  t27: v16i8 = bitcast t26
    t38: i64 = X86ISD::Wrapper TargetConstantPool:i64<<16 x i8> <i8 0, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>> 0
  t36: v16i8,ch = load<(load (s128) from constant-pool)> t0, t38, undef:i64
  t13: ch = X86ISD::RET_FLAG t12, TargetConstant:i32<0>, Register:v16i8 $xmm0, t12:1... so matchBinaryShuffle() again fails to omit the masking, | 
| llvm/test/CodeGen/X86/insertelement-ones.ll | ||
|---|---|---|
| 315 ↗ | (On Diff #373251) | We're going to have to improve INSERT_VECTOR_ELT handling of 0/-1 elements - just AND/OR if we don't have a legal PINSRB instruction (pre-SSE41). | 
| llvm/test/CodeGen/X86/oddshuffles.ll | ||
| 2268 | Looks like we're missing a fold to share scalar_to_vector(x) and scalar_to_vector(trunc(x)) (maybe worth supporting scalar_to_vector(ext(x)) as well)? | |
| llvm/test/CodeGen/X86/insertelement-ones.ll | ||
|---|---|---|
| 315 ↗ | (On Diff #373251) | It looks like we might be able to do this more easily by extending lowerShuffleAsBitMask to handle the allones elements case as well as the zero elements case. | 
| llvm/test/CodeGen/X86/insertelement-ones.ll | ||
|---|---|---|
| 315 ↗ | (On Diff #373251) | Note that X86TargetLowering::LowerINSERT_VECTOR_ELT isn't even called for this test, | 
Rebased ontop of D109989 - llvm/test/CodeGen/X86/insertelement-ones.ll is all good now.
| llvm/test/CodeGen/X86/oddshuffles.ll | ||
|---|---|---|
| 2268 | This stuff is broken. I suppose we could look past ext/trunc of scalar_to_vector operand, Optimized legalized selection DAG: %bb.0 'splat_v3i32:'
SelectionDAG has 28 nodes:
  t0: ch = EntryToken
    t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t27: i64,ch = load<(load (s64) from %ir.ptr, align 1)> t0, t2, undef:i64
      t24: v8i32 = BUILD_VECTOR Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
          t30: v2i64 = scalar_to_vector t27
        t107: v4i64 = insert_subvector undef:v4i64, t30, Constant:i64<0>
      t108: v8i32 = bitcast t107
    t101: v8i32 = X86ISD::BLENDI t24, t108, TargetConstant:i8<2>
  t19: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t101
      t25: v8i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
          t91: i32 = truncate t27
        t92: v4i32 = X86ISD::VBROADCAST t91
      t94: v8i32 = insert_subvector undef:v8i32, t92, Constant:i64<0>
    t97: v8i32 = X86ISD::BLENDI t25, t94, TargetConstant:i8<4>
  t21: ch,glue = CopyToReg t19, Register:v8i32 $ymm1, t97, t19:1
  t22: ch = X86ISD::RET_FLAG t21, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, t21:1 | |
| llvm/test/CodeGen/X86/oddshuffles.ll | ||
|---|---|---|
| 2268 | While something like this should work: diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp index 5a49f33e46fe..4d7c2c2a8651 100644 --- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp @@ -21824,6 +21824,13 @@ SDValue DAGCombiner::visitSCALAR_TO_VECTOR(SDNode *N) { } } + // Fold SCALAR_TO_VECTOR(TRUNCATE(V)) to SCALAR_TO_VECTOR(V), + // by making trucation of the operand implicit. + if (InVal.getOpcode() == ISD::TRUNCATE && VT.isFixedLengthVector() && + Level < AfterLegalizeDAG) + return DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(N), VT, + InVal->getOperand(0)); + return SDValue(); } diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp index 09ba7af6e38a..695cc8303cc1 100644 --- a/llvm/lib/Target/X86/X86ISelLowering.cpp +++ b/llvm/lib/Target/X86/X86ISelLowering.cpp @@ -14076,10 +14076,12 @@ static SDValue lowerShuffleAsBroadcast(const SDLoc &DL, MVT VT, SDValue V1, // If this is a scalar, do the broadcast on this type and bitcast. if (!V.getValueType().isVector()) { - assert(V.getScalarValueSizeInBits() == NumEltBits && - "Unexpected scalar size"); - MVT BroadcastVT = MVT::getVectorVT(V.getSimpleValueType(), - VT.getVectorNumElements()); + if(V.getValueType().isInteger() && + V.getScalarValueSizeInBits() > NumEltBits) + V = DAG.getNode(ISD::TRUNCATE, DL, VT.getScalarType(), V); + assert(V.getScalarValueSizeInBits() == NumEltBits && "Unexpected scalar size"); + MVT BroadcastVT = + MVT::getVectorVT(V.getSimpleValueType(), VT.getVectorNumElements()); return DAG.getBitcast(VT, DAG.getNode(Opcode, DL, BroadcastVT, V)); } it doesn't catch anything with the cut-off, | |
| llvm/test/CodeGen/X86/oddshuffles.ll | ||
|---|---|---|
| 2268 | I've looked again, and i'm not sure i have enough motivation to tackle all the fallout from the | |
This can probably move to the APIntOps helpers