This was suggested in https://reviews.llvm.org/D108382#inline-1039018,
maybe hopefully this will help avoid regressions in that patch.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
37911 | Op might be a different width to the Root - see the "Widen any subvector shuffle inputs we've collected." code below. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
37911 | I keep hitting the same pitfail. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
37911 | We still need to do this before the widenSubVector() code - otherwise we'll never be able to simplify any input that doesn't match RootSizeInBits, which are likely to be the most interesting cases imo. | |
37965 | This seems to be really bulky for what its actually doing. I don't think we need to create this shuffle mask for instance, we should be able to create a demanded elts mask directly and then trunc/scale it for the input's size. I keep meaning to create a scaleDemandedMask() common helper method as we have several places that would use it (e.g. SelectionDAG.computeKnownBits bitcast handling and other parts of value tracking). |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
37965 | That is what what i initially came up with, and it's *much* uglier than this code :) |
Try to move before widenSubVector() - now without miscompiles?
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
37594 | Let's do that afterwards? | |
37974 | Ok, i admit i've tried to avoid doing that because i don't quite understand all of the logic here. |
llvm/test/CodeGen/X86/insertelement-ones.ll | ||
---|---|---|
389 | Any luck on improving this? |
llvm/test/CodeGen/X86/insertelement-ones.ll | ||
---|---|---|
389 | This one is obscure. We have: mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 30 -2 matchBinaryShuffle() EltSizeInBits: 8 V1: t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1 t3: v16i8 = Register %1 V2: t74: v16i8 = X86ISD::VSHLDQ t51, TargetConstant:i8<14> t51: v16i8 = bitcast t50 t50: v4i32 = scalar_to_vector Constant:i32<255> t49: i32 = Constant<255> t73: i8 = TargetConstant<14> We can't say anything about t4, but i think it's obvious that t74 is actually I think we'd basically have to do computeKnownBits() for each element of V1/V2 separately. Should i keep looking? |
Rebased ontop main+D109726.
The noted regression is gone (but many more took it's place.)
llvm/test/CodeGen/X86/insertelement-ones.ll | ||
---|---|---|
314 | Here we have: Optimized legalized selection DAG: %bb.0 'insert_v16i8_x123456789ABCDEx:' SelectionDAG has 20 nodes: t0: ch = EntryToken t2: v16i8,ch = CopyFromReg t0, Register:v16i8 %0 t19: v16i8 = and t2, t36 t20: v16i8 = X86ISD::ANDNP t36, t27 t21: v16i8 = or t19, t20 t33: v16i8 = X86ISD::VSHLDQ t27, TargetConstant:i8<15> t45: v16i8 = or t21, t33 t12: ch,glue = CopyToReg t0, Register:v16i8 $xmm0, t45 t26: v4i32 = scalar_to_vector Constant:i32<255> t27: v16i8 = bitcast t26 t38: i64 = X86ISD::Wrapper TargetConstantPool:i64<<16 x i8> <i8 0, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>> 0 t36: v16i8,ch = load<(load (s128) from constant-pool)> t0, t38, undef:i64 t13: ch = X86ISD::RET_FLAG t12, TargetConstant:i32<0>, Register:v16i8 $xmm0, t12:1 ... so matchBinaryShuffle() again fails to omit the masking, |
llvm/test/CodeGen/X86/insertelement-ones.ll | ||
---|---|---|
318 | We're going to have to improve INSERT_VECTOR_ELT handling of 0/-1 elements - just AND/OR if we don't have a legal PINSRB instruction (pre-SSE41). | |
llvm/test/CodeGen/X86/oddshuffles.ll | ||
2268 | Looks like we're missing a fold to share scalar_to_vector(x) and scalar_to_vector(trunc(x)) (maybe worth supporting scalar_to_vector(ext(x)) as well)? |
llvm/test/CodeGen/X86/insertelement-ones.ll | ||
---|---|---|
318 | It looks like we might be able to do this more easily by extending lowerShuffleAsBitMask to handle the allones elements case as well as the zero elements case. |
llvm/test/CodeGen/X86/insertelement-ones.ll | ||
---|---|---|
318 | Note that X86TargetLowering::LowerINSERT_VECTOR_ELT isn't even called for this test, |
Rebased ontop of D109989 - llvm/test/CodeGen/X86/insertelement-ones.ll is all good now.
llvm/test/CodeGen/X86/oddshuffles.ll | ||
---|---|---|
2268 | This stuff is broken. I suppose we could look past ext/trunc of scalar_to_vector operand, Optimized legalized selection DAG: %bb.0 'splat_v3i32:' SelectionDAG has 28 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t27: i64,ch = load<(load (s64) from %ir.ptr, align 1)> t0, t2, undef:i64 t24: v8i32 = BUILD_VECTOR Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0> t30: v2i64 = scalar_to_vector t27 t107: v4i64 = insert_subvector undef:v4i64, t30, Constant:i64<0> t108: v8i32 = bitcast t107 t101: v8i32 = X86ISD::BLENDI t24, t108, TargetConstant:i8<2> t19: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t101 t25: v8i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0> t91: i32 = truncate t27 t92: v4i32 = X86ISD::VBROADCAST t91 t94: v8i32 = insert_subvector undef:v8i32, t92, Constant:i64<0> t97: v8i32 = X86ISD::BLENDI t25, t94, TargetConstant:i8<4> t21: ch,glue = CopyToReg t19, Register:v8i32 $ymm1, t97, t19:1 t22: ch = X86ISD::RET_FLAG t21, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, t21:1 |
llvm/test/CodeGen/X86/oddshuffles.ll | ||
---|---|---|
2268 | While something like this should work: diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp index 5a49f33e46fe..4d7c2c2a8651 100644 --- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp @@ -21824,6 +21824,13 @@ SDValue DAGCombiner::visitSCALAR_TO_VECTOR(SDNode *N) { } } + // Fold SCALAR_TO_VECTOR(TRUNCATE(V)) to SCALAR_TO_VECTOR(V), + // by making trucation of the operand implicit. + if (InVal.getOpcode() == ISD::TRUNCATE && VT.isFixedLengthVector() && + Level < AfterLegalizeDAG) + return DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(N), VT, + InVal->getOperand(0)); + return SDValue(); } diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp index 09ba7af6e38a..695cc8303cc1 100644 --- a/llvm/lib/Target/X86/X86ISelLowering.cpp +++ b/llvm/lib/Target/X86/X86ISelLowering.cpp @@ -14076,10 +14076,12 @@ static SDValue lowerShuffleAsBroadcast(const SDLoc &DL, MVT VT, SDValue V1, // If this is a scalar, do the broadcast on this type and bitcast. if (!V.getValueType().isVector()) { - assert(V.getScalarValueSizeInBits() == NumEltBits && - "Unexpected scalar size"); - MVT BroadcastVT = MVT::getVectorVT(V.getSimpleValueType(), - VT.getVectorNumElements()); + if(V.getValueType().isInteger() && + V.getScalarValueSizeInBits() > NumEltBits) + V = DAG.getNode(ISD::TRUNCATE, DL, VT.getScalarType(), V); + assert(V.getScalarValueSizeInBits() == NumEltBits && "Unexpected scalar size"); + MVT BroadcastVT = + MVT::getVectorVT(V.getSimpleValueType(), VT.getVectorNumElements()); return DAG.getBitcast(VT, DAG.getNode(Opcode, DL, BroadcastVT, V)); } it doesn't catch anything with the cut-off, |
llvm/test/CodeGen/X86/oddshuffles.ll | ||
---|---|---|
2268 | I've looked again, and i'm not sure i have enough motivation to tackle all the fallout from the |
This can probably move to the APIntOps helpers