In most test changes this allows us to drop some broadcasts/shuffles.
I think i got the logic right (at least, i have already caught the obvious bugs i had..)
Paths
| Differential D104156
[DAGCombine][X86][ARM] EXTRACT_SUBVECTOR(VECTOR_SHUFFLE(?,?,Mask)) -> VECTOR_SHUFFLE(EXTRACT_SUBVECTOR(?, ?), EXTRACT_SUBVECTOR(?, ?), Mask') ClosedPublic Authored by lebedev.ri on Jun 11 2021, 3:31 PM.
Details Summary In most test changes this allows us to drop some broadcasts/shuffles. I think i got the logic right (at least, i have already caught the obvious bugs i had..)
Diff Detail
Event TimelineHerald added subscribers: ecnelises, danielkiss, pengfei and 2 others. · View Herald TranscriptJun 11 2021, 3:31 PM Comment Actions Aha, looks like this is salvageable, adding even more restrictions seems to have gotten rid of obvious regressions.
lebedev.ri added inline comments. lebedev.ri marked an inline comment as done. Comment ActionsRebasing now that D104250 landed. Comment Actions Many remaining cases are rotates, with the pattern like: Combining: t0: ch = EntryToken Optimized vector-legalized selection DAG: %bb.0 'splatvar_funnnel_v8i32:' SelectionDAG has 26 nodes: t0: ch = EntryToken t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0 t25: v2i64 = zero_extend_vector_inreg t45 t26: v4i32 = bitcast t25 t27: v8i32 = X86ISD::VSHL t2, t26 t40: v4i32 = BUILD_VECTOR Constant:i32<32>, Constant:i32<32>, Constant:i32<32>, Constant:i32<32> t38: v4i32 = sub t40, t45 t30: v2i64 = zero_extend_vector_inreg t38 t31: v4i32 = bitcast t30 t32: v8i32 = X86ISD::VSRL t2, t31 t21: v8i32 = or t27, t32 t10: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t21 t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1 t6: v8i32 = vector_shuffle<0,0,0,0,0,0,0,0> t4, undef:v8i32 t43: v4i32 = extract_subvector t6, Constant:i64<0> t47: v4i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31> t45: v4i32 = and t43, t47 t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8i32 $ymm0, t10:1 Let's suppose we start at t27: v8i32 = X86ISD::VSHL t2, t26, and demand the 0'th element of shift amount t26. Comment Actions
Looks like SimplifyMultipleUseDemandedBits() could theoretically deal with it, Comment Actions
Hm, and perhaps the biggest issue is, how would we skip over element count changing extract_subvector? Comment Actions
What if the ZERO_EXTEND_VECTOR_INREG case is extended to just bitcast the source if we only need the 0th element and the upper source elements aliasing that 0th element is known to be zero? Comment Actions
But in that snippet they are clearly not zeros, since it's a splat vector? Comment Actions @RKSimon ping. any further thoughts/hints at how this might be achievable via demandedbits/etc? Comment Actions Sorry for not getting back to this - I've been meaning to take a look at the x86 test changes - the fact that they are mostly from the same funnelshift/rotation lowering code suggests we've just missed something for splat values in there. Comment Actions I still think getting to the bottom of why we don't already fold the splatted rotation/funnel shift amount would be more useful. Comment Actions Rebased
rGb95f66ad786b8f2814d4ef4373e8ac3902e6f62a is cheating :) Comment Actions @RKSimon do you have any concrete arguments against this? I'm seeing this pattern (well, roughly) when looking at https://bugs.llvm.org/show_bug.cgi?id=52337: Optimized type-legalized selection DAG: %bb.0 'mask_i32_stride8_vf4:' SelectionDAG has 56 nodes: t0: ch = EntryToken t6: i64,ch = CopyFromReg t0, Register:i64 %2 t294: v16i8 = vector_shuffle<0,u,0,u,0,u,0,u,0,u,0,u,0,u,0,u> t282, undef:v16i8 t255: v8i16 = bitcast t294 t236: v8i32 = any_extend t255 t261: v8i32 = shl t236, t259 t194: v8i32,ch = masked_load<(load (s256) from %ir.ptr, align 4)> t0, t6, undef:i64, t261, undef:v8i32 t29: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t194 t195: i64 = add t6, Constant:i64<32> t292: v16i8 = vector_shuffle<4,u,4,u,4,u,4,u,4,u,4,u,4,u,4,u> t282, undef:v16i8 t257: v8i16 = bitcast t292 t216: v8i32 = any_extend t257 t260: v8i32 = shl t216, t259 t196: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 32, align 4)> t0, t195, undef:i64, t260, undef:v8i32 t31: ch,glue = CopyToReg t29, Register:v8i32 $ymm1, t196, t29:1 t62: i64 = add t6, Constant:i64<64> t268: v8i16 = any_extend_vector_inreg t264 t173: v8i32 = any_extend t268 t281: v8i32 = shl t173, t259 t129: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 64, align 4)> t0, t62, undef:i64, t281, undef:v8i32 t33: ch,glue = CopyToReg t31, Register:v8i32 $ymm2, t129, t31:1 t280: i64 = add t6, Constant:i64<96> t275: v16i8 = vector_shuffle<8,u,9,u,10,u,11,u,12,u,13,u,14,u,15,u> t264, undef:v16i8 t272: v8i16 = bitcast t275 t152: v8i32 = any_extend t272 t278: v8i32 = shl t152, t259 t132: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 96, align 4)> t0, t280, undef:i64, t278, undef:v8i32 t35: ch,glue = CopyToReg t33, Register:v8i32 $ymm3, t132, t33:1 t259: v8i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31> t287: v32i8 = concat_vectors t282, undef:v16i8 t289: v32i8 = vector_shuffle<u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,8,8,8,8,8,8,8,8,12,12,12,12,12,12,12,12> t287, undef:v32i8 t264: v16i8 = extract_subvector t289, Constant:i64<16> t2: i64,ch = CopyFromReg t0, Register:i64 %0 t9: v4i32,ch = load<(load (s128) from %ir.in.vec, align 32)> t0, t2, undef:i64 t11: v4i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0> t42: v4i32 = setcc t9, t11, setlt:ch t282: v16i8 = bitcast t42 t36: ch = X86ISD::RET_FLAG t35, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, Register:v8i32 $ymm2, Register:v8i32 $ymm3, t35:1 i.e. t287: v32i8 = concat_vectors t282, undef:v16i8 t289: v32i8 = vector_shuffle<u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,8,8,8,8,8,8,8,8,12,12,12,12,12,12,12,12> t287, undef:v32i8 t264: v16i8 = extract_subvector t289, Constant:i64<16>` Comment Actions
Uhm, oops, let me upload the right diff. Comment Actions Actually upload the diff with the change, not just the test changes. Comment Actions Do we have any test coverage with any subvectors that are not half the width of the source? 512-bit -> 128-bit AVX512 shuffles for instance.
lebedev.ri marked 3 inline comments as done. Comment Actions@RKSimon thank you for taking a look!
I've checked and it doesn't appear so.
Comment Actions ping
lebedev.ri marked an inline comment as done. Comment ActionsForego of the undef constant folding, rebased. lebedev.ri marked 2 inline comments as done. Comment ActionsRelaxed the profitability check - i'm not sure this is better, and that is why it was there.
lebedev.ri marked an inline comment as done. Comment ActionsAnd back to restrictive profitability check.
This revision is now accepted and ready to land.Dec 13 2021, 4:32 AM Comment Actions
Hmm. Thank you for the review. Comment Actions
I'm mainly interested in the rotation splat fixes as this removes a headache for some ongoing work for better vector rotation/funnel shift codegen. This revision was landed with ongoing or failed builds.Dec 13 2021, 9:06 AM Closed by commit rGc1a36ba002b8: [DAGCombine][X86][ARM] EXTRACT_SUBVECTOR(VECTOR_SHUFFLE(?,?,Mask)) ->… (authored by lebedev.ri). · Explain Why This revision was automatically updated to reflect the committed changes.
Revision Contents
Diff 392056 llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
llvm/test/CodeGen/ARM/vext.ll
llvm/test/CodeGen/X86/vector-fshl-rot-256.ll
llvm/test/CodeGen/X86/vector-fshr-rot-256.ll
llvm/test/CodeGen/X86/vector-fshr-rot-512.ll
llvm/test/CodeGen/X86/vector-rotate-256.ll
llvm/test/CodeGen/X86/vector-shift-ashr-256.ll
llvm/test/CodeGen/X86/vector-shift-lshr-256.ll
llvm/test/CodeGen/X86/vector-shift-shl-256.ll
llvm/test/CodeGen/X86/vector-trunc.ll
|
"The old shuffle needs to go away."