Page MenuHomePhabricator

[DAGCombine][X86][ARM] EXTRACT_SUBVECTOR(VECTOR_SHUFFLE(?,?,Mask)) -> VECTOR_SHUFFLE(EXTRACT_SUBVECTOR(?, ?), EXTRACT_SUBVECTOR(?, ?), Mask')
Needs ReviewPublic

Authored by lebedev.ri on Jun 11 2021, 3:31 PM.

Details

Summary

In most test changes this allows us to drop some broadcasts/shuffles.

I think i got the logic right (at least, i have already caught the obvious bugs i had..)

Diff Detail

Event Timeline

lebedev.ri created this revision.Jun 11 2021, 3:31 PM
lebedev.ri requested review of this revision.Jun 11 2021, 3:31 PM
lebedev.ri edited the summary of this revision. (Show Details)

Aha, looks like this is salvageable, adding even more restrictions seems to have gotten rid of obvious regressions.

Add more context to the diff, NFC.

RKSimon added inline comments.Jun 14 2021, 5:59 AM
llvm/test/CodeGen/X86/min-legal-vector-width.ll
1693

A lot of these changes just look like we're missing something in SimplifyDemandedVectorElts tbh

lebedev.ri marked an inline comment as done.Jun 14 2021, 11:25 AM
lebedev.ri added inline comments.
llvm/test/CodeGen/X86/min-legal-vector-width.ll
1693

Aha, D104250.

Matt added a subscriber: Matt.Jun 14 2021, 12:40 PM
lebedev.ri marked an inline comment as done.

Rebasing now that D104250 landed.
Some changes remain, and some of them still look like they're missing SimplifyDemandedVectorElts() folds.

Many remaining cases are rotates, with the pattern like:

Combining: t0: ch = EntryToken
Optimized vector-legalized selection DAG: %bb.0 'splatvar_funnnel_v8i32:'
SelectionDAG has 26 nodes:
  t0: ch = EntryToken
  t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0
          t25: v2i64 = zero_extend_vector_inreg t45
        t26: v4i32 = bitcast t25
      t27: v8i32 = X86ISD::VSHL t2, t26
              t40: v4i32 = BUILD_VECTOR Constant:i32<32>, Constant:i32<32>, Constant:i32<32>, Constant:i32<32>
            t38: v4i32 = sub t40, t45
          t30: v2i64 = zero_extend_vector_inreg t38
        t31: v4i32 = bitcast t30
      t32: v8i32 = X86ISD::VSRL t2, t31
    t21: v8i32 = or t27, t32
  t10: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t21
        t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1
      t6: v8i32 = vector_shuffle<0,0,0,0,0,0,0,0> t4, undef:v8i32
    t43: v4i32 = extract_subvector t6, Constant:i64<0>
    t47: v4i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>
  t45: v4i32 = and t43, t47
  t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8i32 $ymm0, t10:1

Let's suppose we start at t27: v8i32 = X86ISD::VSHL t2, t26, and demand the 0'th element of shift amount t26.
I think the problem is that t45 has another use - t38,
for another shift. Likewise, if we start from the other shift.
I'm not sure if this can be solved within the existing demandedelts infra?

Many remaining cases are rotates, with the pattern like:

Combining: t0: ch = EntryToken
Optimized vector-legalized selection DAG: %bb.0 'splatvar_funnnel_v8i32:'
SelectionDAG has 26 nodes:
  t0: ch = EntryToken
  t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0
          t25: v2i64 = zero_extend_vector_inreg t45
        t26: v4i32 = bitcast t25
      t27: v8i32 = X86ISD::VSHL t2, t26
              t40: v4i32 = BUILD_VECTOR Constant:i32<32>, Constant:i32<32>, Constant:i32<32>, Constant:i32<32>
            t38: v4i32 = sub t40, t45
          t30: v2i64 = zero_extend_vector_inreg t38
        t31: v4i32 = bitcast t30
      t32: v8i32 = X86ISD::VSRL t2, t31
    t21: v8i32 = or t27, t32
  t10: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t21
        t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1
      t6: v8i32 = vector_shuffle<0,0,0,0,0,0,0,0> t4, undef:v8i32
    t43: v4i32 = extract_subvector t6, Constant:i64<0>
    t47: v4i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>
  t45: v4i32 = and t43, t47
  t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8i32 $ymm0, t10:1

Let's suppose we start at t27: v8i32 = X86ISD::VSHL t2, t26, and demand the 0'th element of shift amount t26.
I think the problem is that t45 has another use - t38,
for another shift. Likewise, if we start from the other shift.
I'm not sure if this can be solved within the existing demandedelts infra?

Looks like SimplifyMultipleUseDemandedBits() could theoretically deal with it,
but then it would need to learn not only about scalar_to_vector-of-extract_vector_elt,
but also and and vector_shuffle would need to be handled to recourse throught SimplifyMultipleUseDemandedBits().

Many remaining cases are rotates, with the pattern like:

Combining: t0: ch = EntryToken
Optimized vector-legalized selection DAG: %bb.0 'splatvar_funnnel_v8i32:'
SelectionDAG has 26 nodes:
  t0: ch = EntryToken
  t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0
          t25: v2i64 = zero_extend_vector_inreg t45
        t26: v4i32 = bitcast t25
      t27: v8i32 = X86ISD::VSHL t2, t26
              t40: v4i32 = BUILD_VECTOR Constant:i32<32>, Constant:i32<32>, Constant:i32<32>, Constant:i32<32>
            t38: v4i32 = sub t40, t45
          t30: v2i64 = zero_extend_vector_inreg t38
        t31: v4i32 = bitcast t30
      t32: v8i32 = X86ISD::VSRL t2, t31
    t21: v8i32 = or t27, t32
  t10: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t21
        t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1
      t6: v8i32 = vector_shuffle<0,0,0,0,0,0,0,0> t4, undef:v8i32
    t43: v4i32 = extract_subvector t6, Constant:i64<0>
    t47: v4i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>
  t45: v4i32 = and t43, t47
  t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8i32 $ymm0, t10:1

Let's suppose we start at t27: v8i32 = X86ISD::VSHL t2, t26, and demand the 0'th element of shift amount t26.
I think the problem is that t45 has another use - t38,
for another shift. Likewise, if we start from the other shift.
I'm not sure if this can be solved within the existing demandedelts infra?

Looks like SimplifyMultipleUseDemandedBits() could theoretically deal with it,
but then it would need to learn not only about scalar_to_vector-of-extract_vector_elt,
but also and and vector_shuffle would need to be handled to recourse throught SimplifyMultipleUseDemandedBits().

Hm, and perhaps the biggest issue is, how would we skip over element count changing extract_subvector?
So i'm not sure the rest can be dealt with by SimplifyMultipleUseDemandedBits().

So i'm not sure the rest can be dealt with by SimplifyMultipleUseDemandedBits().

What if the ZERO_EXTEND_VECTOR_INREG case is extended to just bitcast the source if we only need the 0th element and the upper source elements aliasing that 0th element is known to be zero?

So i'm not sure the rest can be dealt with by SimplifyMultipleUseDemandedBits().

What if the ZERO_EXTEND_VECTOR_INREG case is extended to just bitcast the source if we only need the 0th element and the upper source elements aliasing that 0th element is known to be zero?

But in that snippet they are clearly not zeros, since it's a splat vector?

@RKSimon ping. any further thoughts/hints at how this might be achievable via demandedbits/etc?

Sorry for not getting back to this - I've been meaning to take a look at the x86 test changes - the fact that they are mostly from the same funnelshift/rotation lowering code suggests we've just missed something for splat values in there.

I still think getting to the bottom of why we don't already fold the splatted rotation/funnel shift amount would be more useful.

Rebased

I still think getting to the bottom of why we don't already fold the splatted rotation/funnel shift amount would be more useful.

rGb95f66ad786b8f2814d4ef4373e8ac3902e6f62a is cheating :)
I was sure the question was *not* about fixing the emitted code,
but fixing the already-emitted code.