In most test changes this allows us to drop some broadcasts/shuffles.
I think i got the logic right (at least, i have already caught the obvious bugs i had..)
Paths
| Differential D104156
[DAGCombine][X86][ARM] EXTRACT_SUBVECTOR(VECTOR_SHUFFLE(?,?,Mask)) -> VECTOR_SHUFFLE(EXTRACT_SUBVECTOR(?, ?), EXTRACT_SUBVECTOR(?, ?), Mask') ClosedPublic Authored by lebedev.ri on Jun 11 2021, 3:31 PM.
Details Summary In most test changes this allows us to drop some broadcasts/shuffles. I think i got the logic right (at least, i have already caught the obvious bugs i had..)
Diff Detail
Event TimelineHerald added subscribers: ecnelises, danielkiss, pengfei and 2 others. · View Herald TranscriptJun 11 2021, 3:31 PM Comment Actions Aha, looks like this is salvageable, adding even more restrictions seems to have gotten rid of obvious regressions.
lebedev.ri added inline comments. lebedev.ri marked an inline comment as done. Comment ActionsRebasing now that D104250 landed. Comment Actions Many remaining cases are rotates, with the pattern like: Combining: t0: ch = EntryToken
Optimized vector-legalized selection DAG: %bb.0 'splatvar_funnnel_v8i32:'
SelectionDAG has 26 nodes:
t0: ch = EntryToken
t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0
t25: v2i64 = zero_extend_vector_inreg t45
t26: v4i32 = bitcast t25
t27: v8i32 = X86ISD::VSHL t2, t26
t40: v4i32 = BUILD_VECTOR Constant:i32<32>, Constant:i32<32>, Constant:i32<32>, Constant:i32<32>
t38: v4i32 = sub t40, t45
t30: v2i64 = zero_extend_vector_inreg t38
t31: v4i32 = bitcast t30
t32: v8i32 = X86ISD::VSRL t2, t31
t21: v8i32 = or t27, t32
t10: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t21
t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1
t6: v8i32 = vector_shuffle<0,0,0,0,0,0,0,0> t4, undef:v8i32
t43: v4i32 = extract_subvector t6, Constant:i64<0>
t47: v4i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>
t45: v4i32 = and t43, t47
t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8i32 $ymm0, t10:1Let's suppose we start at t27: v8i32 = X86ISD::VSHL t2, t26, and demand the 0'th element of shift amount t26. Comment Actions
Looks like SimplifyMultipleUseDemandedBits() could theoretically deal with it, Comment Actions
Hm, and perhaps the biggest issue is, how would we skip over element count changing extract_subvector? Comment Actions
What if the ZERO_EXTEND_VECTOR_INREG case is extended to just bitcast the source if we only need the 0th element and the upper source elements aliasing that 0th element is known to be zero? Comment Actions
But in that snippet they are clearly not zeros, since it's a splat vector? Comment Actions @RKSimon ping. any further thoughts/hints at how this might be achievable via demandedbits/etc? Comment Actions Sorry for not getting back to this - I've been meaning to take a look at the x86 test changes - the fact that they are mostly from the same funnelshift/rotation lowering code suggests we've just missed something for splat values in there. Comment Actions I still think getting to the bottom of why we don't already fold the splatted rotation/funnel shift amount would be more useful. Comment Actions Rebased
rGb95f66ad786b8f2814d4ef4373e8ac3902e6f62a is cheating :) Comment Actions @RKSimon do you have any concrete arguments against this? I'm seeing this pattern (well, roughly) when looking at https://bugs.llvm.org/show_bug.cgi?id=52337: Optimized type-legalized selection DAG: %bb.0 'mask_i32_stride8_vf4:'
SelectionDAG has 56 nodes:
t0: ch = EntryToken
t6: i64,ch = CopyFromReg t0, Register:i64 %2
t294: v16i8 = vector_shuffle<0,u,0,u,0,u,0,u,0,u,0,u,0,u,0,u> t282, undef:v16i8
t255: v8i16 = bitcast t294
t236: v8i32 = any_extend t255
t261: v8i32 = shl t236, t259
t194: v8i32,ch = masked_load<(load (s256) from %ir.ptr, align 4)> t0, t6, undef:i64, t261, undef:v8i32
t29: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t194
t195: i64 = add t6, Constant:i64<32>
t292: v16i8 = vector_shuffle<4,u,4,u,4,u,4,u,4,u,4,u,4,u,4,u> t282, undef:v16i8
t257: v8i16 = bitcast t292
t216: v8i32 = any_extend t257
t260: v8i32 = shl t216, t259
t196: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 32, align 4)> t0, t195, undef:i64, t260, undef:v8i32
t31: ch,glue = CopyToReg t29, Register:v8i32 $ymm1, t196, t29:1
t62: i64 = add t6, Constant:i64<64>
t268: v8i16 = any_extend_vector_inreg t264
t173: v8i32 = any_extend t268
t281: v8i32 = shl t173, t259
t129: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 64, align 4)> t0, t62, undef:i64, t281, undef:v8i32
t33: ch,glue = CopyToReg t31, Register:v8i32 $ymm2, t129, t31:1
t280: i64 = add t6, Constant:i64<96>
t275: v16i8 = vector_shuffle<8,u,9,u,10,u,11,u,12,u,13,u,14,u,15,u> t264, undef:v16i8
t272: v8i16 = bitcast t275
t152: v8i32 = any_extend t272
t278: v8i32 = shl t152, t259
t132: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 96, align 4)> t0, t280, undef:i64, t278, undef:v8i32
t35: ch,glue = CopyToReg t33, Register:v8i32 $ymm3, t132, t33:1
t259: v8i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>
t287: v32i8 = concat_vectors t282, undef:v16i8
t289: v32i8 = vector_shuffle<u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,8,8,8,8,8,8,8,8,12,12,12,12,12,12,12,12> t287, undef:v32i8
t264: v16i8 = extract_subvector t289, Constant:i64<16>
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t9: v4i32,ch = load<(load (s128) from %ir.in.vec, align 32)> t0, t2, undef:i64
t11: v4i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
t42: v4i32 = setcc t9, t11, setlt:ch
t282: v16i8 = bitcast t42
t36: ch = X86ISD::RET_FLAG t35, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, Register:v8i32 $ymm2, Register:v8i32 $ymm3, t35:1i.e. t287: v32i8 = concat_vectors t282, undef:v16i8 t289: v32i8 = vector_shuffle<u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,8,8,8,8,8,8,8,8,12,12,12,12,12,12,12,12> t287, undef:v32i8 t264: v16i8 = extract_subvector t289, Constant:i64<16>` Comment Actions
Uhm, oops, let me upload the right diff. Comment Actions Actually upload the diff with the change, not just the test changes. Comment Actions Do we have any test coverage with any subvectors that are not half the width of the source? 512-bit -> 128-bit AVX512 shuffles for instance.
lebedev.ri marked 3 inline comments as done. Comment Actions@RKSimon thank you for taking a look!
I've checked and it doesn't appear so.
Comment Actions ping
lebedev.ri marked an inline comment as done. Comment ActionsForego of the undef constant folding, rebased. lebedev.ri marked 2 inline comments as done. Comment ActionsRelaxed the profitability check - i'm not sure this is better, and that is why it was there.
lebedev.ri marked an inline comment as done. Comment ActionsAnd back to restrictive profitability check.
This revision is now accepted and ready to land.Dec 13 2021, 4:32 AM Comment Actions
Hmm. Thank you for the review. Comment Actions
I'm mainly interested in the rotation splat fixes as this removes a headache for some ongoing work for better vector rotation/funnel shift codegen. This revision was landed with ongoing or failed builds.Dec 13 2021, 9:06 AM Closed by commit rGc1a36ba002b8: [DAGCombine][X86][ARM] EXTRACT_SUBVECTOR(VECTOR_SHUFFLE(?,?,Mask)) ->… (authored by lebedev.ri). · Explain Why This revision was automatically updated to reflect the committed changes.
Revision Contents
Diff 393914 llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
llvm/test/CodeGen/ARM/vext.ll
llvm/test/CodeGen/X86/vector-fshl-rot-256.ll
llvm/test/CodeGen/X86/vector-fshr-rot-256.ll
llvm/test/CodeGen/X86/vector-fshr-rot-512.ll
llvm/test/CodeGen/X86/vector-rotate-256.ll
llvm/test/CodeGen/X86/vector-shift-ashr-256.ll
llvm/test/CodeGen/X86/vector-shift-lshr-256.ll
llvm/test/CodeGen/X86/vector-shift-shl-256.ll
llvm/test/CodeGen/X86/vector-trunc.ll
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
"The old shuffle needs to go away."