This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine][X86][ARM] EXTRACT_SUBVECTOR(VECTOR_SHUFFLE(?,?,Mask)) -> VECTOR_SHUFFLE(EXTRACT_SUBVECTOR(?, ?), EXTRACT_SUBVECTOR(?, ?), Mask')
ClosedPublic

Authored by lebedev.ri on Jun 11 2021, 3:31 PM.

Details

Summary

In most test changes this allows us to drop some broadcasts/shuffles.

I think i got the logic right (at least, i have already caught the obvious bugs i had..)

Diff Detail

Event Timeline

lebedev.ri created this revision.Jun 11 2021, 3:31 PM
lebedev.ri requested review of this revision.Jun 11 2021, 3:31 PM
lebedev.ri edited the summary of this revision. (Show Details)

Aha, looks like this is salvageable, adding even more restrictions seems to have gotten rid of obvious regressions.

Add more context to the diff, NFC.

RKSimon added inline comments.Jun 14 2021, 5:59 AM
llvm/test/CodeGen/X86/min-legal-vector-width.ll
1693 ↗(On Diff #351652)

A lot of these changes just look like we're missing something in SimplifyDemandedVectorElts tbh

lebedev.ri marked an inline comment as done.Jun 14 2021, 11:25 AM
lebedev.ri added inline comments.
llvm/test/CodeGen/X86/min-legal-vector-width.ll
1693 ↗(On Diff #351652)

Aha, D104250.

Matt added a subscriber: Matt.Jun 14 2021, 12:40 PM
lebedev.ri marked an inline comment as done.

Rebasing now that D104250 landed.
Some changes remain, and some of them still look like they're missing SimplifyDemandedVectorElts() folds.

Many remaining cases are rotates, with the pattern like:

Combining: t0: ch = EntryToken
Optimized vector-legalized selection DAG: %bb.0 'splatvar_funnnel_v8i32:'
SelectionDAG has 26 nodes:
  t0: ch = EntryToken
  t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0
          t25: v2i64 = zero_extend_vector_inreg t45
        t26: v4i32 = bitcast t25
      t27: v8i32 = X86ISD::VSHL t2, t26
              t40: v4i32 = BUILD_VECTOR Constant:i32<32>, Constant:i32<32>, Constant:i32<32>, Constant:i32<32>
            t38: v4i32 = sub t40, t45
          t30: v2i64 = zero_extend_vector_inreg t38
        t31: v4i32 = bitcast t30
      t32: v8i32 = X86ISD::VSRL t2, t31
    t21: v8i32 = or t27, t32
  t10: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t21
        t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1
      t6: v8i32 = vector_shuffle<0,0,0,0,0,0,0,0> t4, undef:v8i32
    t43: v4i32 = extract_subvector t6, Constant:i64<0>
    t47: v4i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>
  t45: v4i32 = and t43, t47
  t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8i32 $ymm0, t10:1

Let's suppose we start at t27: v8i32 = X86ISD::VSHL t2, t26, and demand the 0'th element of shift amount t26.
I think the problem is that t45 has another use - t38,
for another shift. Likewise, if we start from the other shift.
I'm not sure if this can be solved within the existing demandedelts infra?

Many remaining cases are rotates, with the pattern like:

Combining: t0: ch = EntryToken
Optimized vector-legalized selection DAG: %bb.0 'splatvar_funnnel_v8i32:'
SelectionDAG has 26 nodes:
  t0: ch = EntryToken
  t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0
          t25: v2i64 = zero_extend_vector_inreg t45
        t26: v4i32 = bitcast t25
      t27: v8i32 = X86ISD::VSHL t2, t26
              t40: v4i32 = BUILD_VECTOR Constant:i32<32>, Constant:i32<32>, Constant:i32<32>, Constant:i32<32>
            t38: v4i32 = sub t40, t45
          t30: v2i64 = zero_extend_vector_inreg t38
        t31: v4i32 = bitcast t30
      t32: v8i32 = X86ISD::VSRL t2, t31
    t21: v8i32 = or t27, t32
  t10: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t21
        t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1
      t6: v8i32 = vector_shuffle<0,0,0,0,0,0,0,0> t4, undef:v8i32
    t43: v4i32 = extract_subvector t6, Constant:i64<0>
    t47: v4i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>
  t45: v4i32 = and t43, t47
  t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8i32 $ymm0, t10:1

Let's suppose we start at t27: v8i32 = X86ISD::VSHL t2, t26, and demand the 0'th element of shift amount t26.
I think the problem is that t45 has another use - t38,
for another shift. Likewise, if we start from the other shift.
I'm not sure if this can be solved within the existing demandedelts infra?

Looks like SimplifyMultipleUseDemandedBits() could theoretically deal with it,
but then it would need to learn not only about scalar_to_vector-of-extract_vector_elt,
but also and and vector_shuffle would need to be handled to recourse throught SimplifyMultipleUseDemandedBits().

Many remaining cases are rotates, with the pattern like:

Combining: t0: ch = EntryToken
Optimized vector-legalized selection DAG: %bb.0 'splatvar_funnnel_v8i32:'
SelectionDAG has 26 nodes:
  t0: ch = EntryToken
  t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0
          t25: v2i64 = zero_extend_vector_inreg t45
        t26: v4i32 = bitcast t25
      t27: v8i32 = X86ISD::VSHL t2, t26
              t40: v4i32 = BUILD_VECTOR Constant:i32<32>, Constant:i32<32>, Constant:i32<32>, Constant:i32<32>
            t38: v4i32 = sub t40, t45
          t30: v2i64 = zero_extend_vector_inreg t38
        t31: v4i32 = bitcast t30
      t32: v8i32 = X86ISD::VSRL t2, t31
    t21: v8i32 = or t27, t32
  t10: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t21
        t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1
      t6: v8i32 = vector_shuffle<0,0,0,0,0,0,0,0> t4, undef:v8i32
    t43: v4i32 = extract_subvector t6, Constant:i64<0>
    t47: v4i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>
  t45: v4i32 = and t43, t47
  t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8i32 $ymm0, t10:1

Let's suppose we start at t27: v8i32 = X86ISD::VSHL t2, t26, and demand the 0'th element of shift amount t26.
I think the problem is that t45 has another use - t38,
for another shift. Likewise, if we start from the other shift.
I'm not sure if this can be solved within the existing demandedelts infra?

Looks like SimplifyMultipleUseDemandedBits() could theoretically deal with it,
but then it would need to learn not only about scalar_to_vector-of-extract_vector_elt,
but also and and vector_shuffle would need to be handled to recourse throught SimplifyMultipleUseDemandedBits().

Hm, and perhaps the biggest issue is, how would we skip over element count changing extract_subvector?
So i'm not sure the rest can be dealt with by SimplifyMultipleUseDemandedBits().

So i'm not sure the rest can be dealt with by SimplifyMultipleUseDemandedBits().

What if the ZERO_EXTEND_VECTOR_INREG case is extended to just bitcast the source if we only need the 0th element and the upper source elements aliasing that 0th element is known to be zero?

So i'm not sure the rest can be dealt with by SimplifyMultipleUseDemandedBits().

What if the ZERO_EXTEND_VECTOR_INREG case is extended to just bitcast the source if we only need the 0th element and the upper source elements aliasing that 0th element is known to be zero?

But in that snippet they are clearly not zeros, since it's a splat vector?

@RKSimon ping. any further thoughts/hints at how this might be achievable via demandedbits/etc?

Sorry for not getting back to this - I've been meaning to take a look at the x86 test changes - the fact that they are mostly from the same funnelshift/rotation lowering code suggests we've just missed something for splat values in there.

I still think getting to the bottom of why we don't already fold the splatted rotation/funnel shift amount would be more useful.

Rebased

I still think getting to the bottom of why we don't already fold the splatted rotation/funnel shift amount would be more useful.

rGb95f66ad786b8f2814d4ef4373e8ac3902e6f62a is cheating :)
I was sure the question was *not* about fixing the emitted code,
but fixing the already-emitted code.

@RKSimon do you have any concrete arguments against this?

I'm seeing this pattern (well, roughly) when looking at https://bugs.llvm.org/show_bug.cgi?id=52337:

Optimized type-legalized selection DAG: %bb.0 'mask_i32_stride8_vf4:'
SelectionDAG has 56 nodes:
  t0: ch = EntryToken
  t6: i64,ch = CopyFromReg t0, Register:i64 %2
            t294: v16i8 = vector_shuffle<0,u,0,u,0,u,0,u,0,u,0,u,0,u,0,u> t282, undef:v16i8
          t255: v8i16 = bitcast t294
        t236: v8i32 = any_extend t255
      t261: v8i32 = shl t236, t259
    t194: v8i32,ch = masked_load<(load (s256) from %ir.ptr, align 4)> t0, t6, undef:i64, t261, undef:v8i32
  t29: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t194
      t195: i64 = add t6, Constant:i64<32>
            t292: v16i8 = vector_shuffle<4,u,4,u,4,u,4,u,4,u,4,u,4,u,4,u> t282, undef:v16i8
          t257: v8i16 = bitcast t292
        t216: v8i32 = any_extend t257
      t260: v8i32 = shl t216, t259
    t196: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 32, align 4)> t0, t195, undef:i64, t260, undef:v8i32
  t31: ch,glue = CopyToReg t29, Register:v8i32 $ymm1, t196, t29:1
      t62: i64 = add t6, Constant:i64<64>
          t268: v8i16 = any_extend_vector_inreg t264
        t173: v8i32 = any_extend t268
      t281: v8i32 = shl t173, t259
    t129: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 64, align 4)> t0, t62, undef:i64, t281, undef:v8i32
  t33: ch,glue = CopyToReg t31, Register:v8i32 $ymm2, t129, t31:1
      t280: i64 = add t6, Constant:i64<96>
            t275: v16i8 = vector_shuffle<8,u,9,u,10,u,11,u,12,u,13,u,14,u,15,u> t264, undef:v16i8
          t272: v8i16 = bitcast t275
        t152: v8i32 = any_extend t272
      t278: v8i32 = shl t152, t259
    t132: v8i32,ch = masked_load<(load (s256) from %ir.ptr + 96, align 4)> t0, t280, undef:i64, t278, undef:v8i32
  t35: ch,glue = CopyToReg t33, Register:v8i32 $ymm3, t132, t33:1
  t259: v8i32 = BUILD_VECTOR Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>, Constant:i32<31>
      t287: v32i8 = concat_vectors t282, undef:v16i8
    t289: v32i8 = vector_shuffle<u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,8,8,8,8,8,8,8,8,12,12,12,12,12,12,12,12> t287, undef:v32i8
  t264: v16i8 = extract_subvector t289, Constant:i64<16>
        t2: i64,ch = CopyFromReg t0, Register:i64 %0
      t9: v4i32,ch = load<(load (s128) from %ir.in.vec, align 32)> t0, t2, undef:i64
      t11: v4i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
    t42: v4i32 = setcc t9, t11, setlt:ch
  t282: v16i8 = bitcast t42
  t36: ch = X86ISD::RET_FLAG t35, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, Register:v8i32 $ymm2, Register:v8i32 $ymm3, t35:1

i.e.

    t287: v32i8 = concat_vectors t282, undef:v16i8
  t289: v32i8 = vector_shuffle<u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u,8,8,8,8,8,8,8,8,12,12,12,12,12,12,12,12> t287, undef:v32i8
t264: v16i8 = extract_subvector t289, Constant:i64<16>`

Rebased, NFC.

where has the code change gone?

where has the code change gone?

Uhm, oops, let me upload the right diff.

lebedev.ri updated this revision to Diff 384384.Nov 3 2021, 3:25 AM

Actually upload the diff with the change, not just the test changes.
Sorry for noise.

Do we have any test coverage with any subvectors that are not half the width of the source? 512-bit -> 128-bit AVX512 shuffles for instance.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20620

Merge the 2 asserts?

assert((-1 <= M) && (M < (2 * WideNumElts)) && "Out-of-bounds shuffle mask?");

20651

This is actually pretty AVX specific - for instance NEON can usually reference both 64-bit vectors for free.

20654

make_pair?

20670

There are a lot of assertions in this function :)

20684

Shouldn't this be done earlier as soon as we know NarrowVT to early-out ?

lebedev.ri updated this revision to Diff 384403.Nov 3 2021, 4:59 AM
lebedev.ri marked 3 inline comments as done.

@RKSimon thank you for taking a look!
Addressed some nits.

Do we have any test coverage with any subvectors that are not half the width of the source? 512-bit -> 128-bit AVX512 shuffles for instance.

I've checked and it doesn't appear so.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20651

Right, this needs relaxation. I intend to at least special-handle the case
where we are extracting from vector concatenation,
then in certain cases subvec idx is also irrelevant.

20670

Generally i feel like there are a lot of invariants in LLVM codebase that aren't asserted.

20684

Well, yes and no. As it can be seen in the comment for the previous code block,
we then would loose return DAG.getUNDEF(NarrowVT); constant-fold case.
Should we not have it?

RKSimon added inline comments.Nov 3 2021, 5:26 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20684

I suppose it depends if these UNDEF paths are actually active?

lebedev.ri added inline comments.Nov 3 2021, 5:33 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20673–20675

I suppose it depends if these UNDEF paths are actually active?

If i comment-out this block, then "Should end up with either one or two ops" assertion fires at least once:

$ ninja check-llvm-codegen
[ 99% 261/262][ 30% 0:00:21 + 0:00:47] Running lit suite /repositories/llvm-project/llvm/test/CodeGen
FAIL: LLVM :: CodeGen/X86/sad.ll (1064 of 18987)
******************** TEST 'LLVM :: CodeGen/X86/sad.ll' FAILED ********************
Script:
--
: 'RUN: at line 2';   /builddirs/llvm-project/build-Clang13/bin/llc < /repositories/llvm-project/llvm/test/CodeGen/X86/sad.ll -mtriple=x86_64-unknown-unknown -mattr=+sse2 | /builddirs/llvm-project/build-Clang13/bin/FileCheck /repositories/llvm-project/llvm/test/CodeGen/X86/sad.ll --check-prefixes=SSE2
: 'RUN: at line 3';   /builddirs/llvm-project/build-Clang13/bin/llc < /repositories/llvm-project/llvm/test/CodeGen/X86/sad.ll -mtriple=x86_64-unknown-unknown -mattr=+avx  | /builddirs/llvm-project/build-Clang13/bin/FileCheck /repositories/llvm-project/llvm/test/CodeGen/X86/sad.ll --check-prefixes=AVX,AVX1
: 'RUN: at line 4';   /builddirs/llvm-project/build-Clang13/bin/llc < /repositories/llvm-project/llvm/test/CodeGen/X86/sad.ll -mtriple=x86_64-unknown-unknown -mattr=+avx2 | /builddirs/llvm-project/build-Clang13/bin/FileCheck /repositories/llvm-project/llvm/test/CodeGen/X86/sad.ll --check-prefixes=AVX,AVX2
: 'RUN: at line 5';   /builddirs/llvm-project/build-Clang13/bin/llc < /repositories/llvm-project/llvm/test/CodeGen/X86/sad.ll -mtriple=x86_64-unknown-unknown -mattr=+avx512f | /builddirs/llvm-project/build-Clang13/bin/FileCheck /repositories/llvm-project/llvm/test/CodeGen/X86/sad.ll --check-prefixes=AVX,AVX512,AVX512F
: 'RUN: at line 6';   /builddirs/llvm-project/build-Clang13/bin/llc < /repositories/llvm-project/llvm/test/CodeGen/X86/sad.ll -mtriple=x86_64-unknown-unknown -mattr=+avx512bw | /builddirs/llvm-project/build-Clang13/bin/FileCheck /repositories/llvm-project/llvm/test/CodeGen/X86/sad.ll --check-prefixes=AVX,AVX512,AVX512BW
--
Exit Code: 1

Command Output (stderr):
--
llc: /repositories/llvm-project/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:20710: llvm::SDValue foldExtractSubvectorFromShuffleVector(llvm::SDNode *, llvm::SelectionDAG &, const llvm::TargetLowering &, bool): Assertion `(NewOps.size() == 1 || NewOps.size() == 2) && "Should end up with either one or two ops"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: /builddirs/llvm-project/build-Clang13/bin/llc -mtriple=x86_64-unknown-unknown -mattr=+avx
1.      Running pass 'Function Pass Manager' on module '<stdin>'.
2.      Running pass 'X86 DAG->DAG Instruction Selection' on function '@sad_unroll_nonzero_initial'

So i would say it is active.

Now that the costmodel patch spam has winded down,
rebase&ping.

ping

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20684

https://reviews.llvm.org/D104156#inline-1078555, but is the comment that some generic handling
should catch such cases regardless? I can drop it if that's the case.

RKSimon added inline comments.Dec 6 2021, 6:41 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20704

Do the isExtractSubvectorCheap tests in their own all_of() check before we start creating nodes? It helps stop oneuse failures in later folds if this fails.

lebedev.ri updated this revision to Diff 392056.Dec 6 2021, 7:25 AM
lebedev.ri marked an inline comment as done.

Forego of the undef constant folding, rebased.

RKSimon added inline comments.Dec 6 2021, 9:58 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20621

WideShuffleVector->getMask().slice() ?

20660

Why do we need this and isExtractSubvectorCheap ?

lebedev.ri marked 2 inline comments as done.

Relaxed the profitability check - i'm not sure this is better, and that is why it was there.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20660

Because the test changes were somewhat less controversial with that check.

lebedev.ri added inline comments.Dec 6 2021, 1:04 PM
llvm/test/CodeGen/X86/vector-fshr-rot-128.ll
402 ↗(On Diff #392122)

The problem is that cheap!=free, so we could be turning a cheap & free extraction (from 0'th subvec)
into a cheap non-free extraction (from non-0'th subvec), or even into two of these.
Or i'm wrong and this explicit extraction is actually better?
Because if it's not, we'll need 'is free subvec extract' hook, i believe?

RKSimon added inline comments.Dec 6 2021, 1:15 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20660

OK, let's keep it for now

lebedev.ri updated this revision to Diff 392163.Dec 6 2021, 1:28 PM
lebedev.ri marked an inline comment as done.

And back to restrictive profitability check.

RKSimon added inline comments.Dec 8 2021, 8:07 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20621

slice() ?

lebedev.ri marked an inline comment as done.

@RKSimon sorry for so much back-and-forth.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20621

Ugh, sorry, that change got lost when i rolled back to previous revision without double-checking...

RKSimon added inline comments.Dec 12 2021, 1:00 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20597

"The old shuffle needs to go away."

20614

SmallVector<int> NewMask;

20616

I'm torn whether you need the complexity of a set here - is the code that much worse if you manually handle 2 x SDValue/int variables?

lebedev.ri added inline comments.Dec 12 2021, 1:11 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20616

I'm not sure how this can be done without a set,
we need something set-like for this,
there could be from 0 to 2 subvectors.

It could be a linear search over a vector,
but i don't think there is a such container in ADT?
(there's SmartSmallSetVector in lang-tools-extra/clang-tidy/misc/NoRecursionCheck.cpp)

RKSimon accepted this revision.Dec 13 2021, 4:32 AM

LGTM - cheers @lebedev.ri !

This revision is now accepted and ready to land.Dec 13 2021, 4:32 AM

LGTM - cheers @lebedev.ri !

Hmm. Thank you for the review.
I will admit, in an alternative reality this comment was going to be "let me just abandon this".

LGTM - cheers @lebedev.ri !

Hmm. Thank you for the review.
I will admit, in an alternative reality this comment was going to be "let me just abandon this".

I'm mainly interested in the rotation splat fixes as this removes a headache for some ongoing work for better vector rotation/funnel shift codegen.

This revision was landed with ongoing or failed builds.Dec 13 2021, 9:06 AM
This revision was automatically updated to reflect the committed changes.