In most test changes this allows us to drop some broadcasts/shuffles.
I think i got the logic right (at least, i have already caught the obvious bugs i had..)
Differential D104156
[DAGCombine][X86][ARM] EXTRACT_SUBVECTOR(VECTOR_SHUFFLE(?,?,Mask)) -> VECTOR_SHUFFLE(EXTRACT_SUBVECTOR(?, ?), EXTRACT_SUBVECTOR(?, ?), Mask') lebedev.ri on Jun 11 2021, 3:31 PM. Authored by
Details In most test changes this allows us to drop some broadcasts/shuffles. I think i got the logic right (at least, i have already caught the obvious bugs i had..)
Diff Detail
Unit Tests Event TimelineComment Actions Aha, looks like this is salvageable, adding even more restrictions seems to have gotten rid of obvious regressions.
Comment Actions Rebasing now that D104250 landed. Comment Actions Many remaining cases are rotates, with the pattern like: Combining: t0: ch = EntryToken Optimized vector-legalized selection DAG: %bb.0 'splatvar_funnnel_v8i32:' SelectionDAG has 26 nodes: t0: ch = EntryToken t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0 t25: v2i64 = zero_extend_vector_inreg t45 t26: v4i32 = bitcast t25 t27: v8i32 = X86ISD::VSHL t2, t26 t40: v4i32 = BUILD_VECTOR Constant:i32<32>, Constant:i32<32>, Constant:i32<32>, Constant:i32<32> t38: v4i32 = sub t40, t45 t30: v2i64 = zero_extend_vector_inreg t38 t31: v4i32 = bitcast t30 t32: v8i32 = X86ISD::VSRL t2, t31 t21: v8i32 = or t27, t32 t10: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t21 t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1 t6: v8i32 = vector_shuffle<0,0,0,0,0,0,0,0> t4, undef:v8i32 t43: v4i32 = extract_subvector t6, Constant:i64<0> t47: v4i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31> t45: v4i32 = and t43, t47 t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8i32 $ymm0, t10:1 Let's suppose we start at t27: v8i32 = X86ISD::VSHL t2, t26, and demand the 0'th element of shift amount t26. Comment Actions Looks like SimplifyMultipleUseDemandedBits() could theoretically deal with it, Comment Actions Hm, and perhaps the biggest issue is, how would we skip over element count changing extract_subvector? Comment Actions What if the ZERO_EXTEND_VECTOR_INREG case is extended to just bitcast the source if we only need the 0th element and the upper source elements aliasing that 0th element is known to be zero? Comment Actions @RKSimon ping. any further thoughts/hints at how this might be achievable via demandedbits/etc? Comment Actions Sorry for not getting back to this - I've been meaning to take a look at the x86 test changes - the fact that they are mostly from the same funnelshift/rotation lowering code suggests we've just missed something for splat values in there. Comment Actions I still think getting to the bottom of why we don't already fold the splatted rotation/funnel shift amount would be more useful. Comment Actions Rebased rGb95f66ad786b8f2814d4ef4373e8ac3902e6f62a is cheating :) Comment Actions @RKSimon do you have any concrete arguments against this? I'm seeing this pattern (well, roughly) when looking at https://bugs.llvm.org/show_bug.cgi?id=52337: Optimized type-legalized selection DAG: %bb.0 'mask_i32_stride8_vf4:' SelectionDAG has 56 nodes: t0: ch = EntryToken t6: i64,ch = CopyFromReg t0, Register:i64 %2 t294: v16i8 = vector_shuffle<0,u,0,u,0,u,0,u,0,u,0,u,0,u,0,u> t282, undef:v16i8 t255: v8i16 = bitcast t294 t236: v8i32 = any_extend t255 t261: v8i32 = shl t236, t259 t194: v8i32,ch = masked_load<(load (s256) from %ir.ptr, align 4)> t0, t6, undef:i64, t261, undef:v8i32 t29: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t194 t195: i64 = add t6, Constant:i64<32> t292: v16i8 = vector_shuffle<4,u,4,u,4,u,4,u,4,u,4,u,4,u,4,u> t282, undef:v16i8 t257: v8i16 = bitcast t292 t216: v8i32 = any_extend t257 t260: v8i32 = shl t216, t259 t196: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 32, align 4)> t0, t195, undef:i64, t260, undef:v8i32 t31: ch,glue = CopyToReg t29, Register:v8i32 $ymm1, t196, t29:1 t62: i64 = add t6, Constant:i64<64> t268: v8i16 = any_extend_vector_inreg t264 t173: v8i32 = any_extend t268 t281: v8i32 = shl t173, t259 t129: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 64, align 4)> t0, t62, undef:i64, t281, undef:v8i32 t33: ch,glue = CopyToReg t31, Register:v8i32 $ymm2, t129, t31:1 t280: i64 = add t6, Constant:i64<96> t275: v16i8 = vector_shuffle<8,u,9,u,10,u,11,u,12,u,13,u,14,u,15,u> t264, undef:v16i8 t272: v8i16 = bitcast t275 t152: v8i32 = any_extend t272 t278: v8i32 = shl t152, t259 t132: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 96, align 4)> t0, t280, undef:i64, t278, undef:v8i32 t35: ch,glue = CopyToReg t33, Register:v8i32 $ymm3, t132, t33:1 t259: v8i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31> t287: v32i8 = concat_vectors t282, undef:v16i8 t289: v32i8 = vector_shuffle<u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,8,8,8,8,8,8,8,8,12,12,12,12,12,12,12,12> t287, undef:v32i8 t264: v16i8 = extract_subvector t289, Constant:i64<16> t2: i64,ch = CopyFromReg t0, Register:i64 %0 t9: v4i32,ch = load<(load (s128) from %ir.in.vec, align 32)> t0, t2, undef:i64 t11: v4i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0> t42: v4i32 = setcc t9, t11, setlt:ch t282: v16i8 = bitcast t42 t36: ch = X86ISD::RET_FLAG t35, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, Register:v8i32 $ymm2, Register:v8i32 $ymm3, t35:1 i.e. t287: v32i8 = concat_vectors t282, undef:v16i8 t289: v32i8 = vector_shuffle<u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,8,8,8,8,8,8,8,8,12,12,12,12,12,12,12,12> t287, undef:v32i8 t264: v16i8 = extract_subvector t289, Constant:i64<16>` Comment Actions Actually upload the diff with the change, not just the test changes. Comment Actions Do we have any test coverage with any subvectors that are not half the width of the source? 512-bit -> 128-bit AVX512 shuffles for instance.
Comment Actions @RKSimon thank you for taking a look! I've checked and it doesn't appear so.
Comment Actions ping
Comment Actions Relaxed the profitability check - i'm not sure this is better, and that is why it was there.
Comment Actions Hmm. Thank you for the review. Comment Actions I'm mainly interested in the rotation splat fixes as this removes a headache for some ongoing work for better vector rotation/funnel shift codegen. |
"The old shuffle needs to go away."