Page MenuHomePhabricator

[RFC][DAGCombine] Make sure combined nodes are added back to the worklist in topological order.
Needs ReviewPublic

Authored by RKSimon on Jun 6 2022, 7:49 AM.

Details

Summary

While this is probably not in a state where it is ready to merge, I decided to submit this for review as per discussion with @RKSimon .

We want to move the DAGCombiner toward processing the node in topological order. Inf act, implementing such as change in the combiner itself is not a major challenge, and I have patches which can do so for quite some time now.

However, finding a path from were we are to where we want to be proved to be fairly challenging:

  • Changing the order in which the nodes are processed affects the register allocator and instruction scheduling, so we get a ton of parasitic changes in the test suite.
  • The combines, backends and test cases tends to overfit the current traversal order.

To quote @RKSimon:

You're hitting a much more severe version of the issues I encountered trying to improve the SimplifyDemandedBits folds, such as:

  • different / missing canonicalization for some ops vs the InstCombine equivalent
  • some weird orders of pattern matches that leave some folds almost impossible to hit
  • some patterns need rewriting to use value tracking instead of matching specific instructions (usually sext/zext patterns of some kind)
  • a lot of very custom DAG folds, usually for one target instruction pattern, in generic and target-specific combines, that seem to have been hacks from the very beginning (I know I'm guilty of that.....)
  • the introduction of new instrinics abs/min/max etc. has meant that some of the DAG folds haven't been updated to account for them
  • we have tests for IR that DAG shouldn't actually ever meet because the middle end should have canonicalized it away and the DAG won't generate those patterns.
  • undef/poison issues are becoming more common as the middle end handling of poison matures
  • early signs of bitrot in some targets - some targets get little attention, others like aarch64/amdgpu are transitioning away from DAG development to GISel (very very slowly.....)

It is obvious that there needs to be some agreed upons trategy to go through this.

This patch is only part of the change required, but is likely enough to notice the problem and come up with a plan.

Diff Detail

Unit TestsFailed

TimeTest
60,070 msx64 debian > MLIR.Examples/standalone::test.toy
Script: -- : 'RUN: at line 1'; /usr/bin/cmake /var/lib/buildkite-agent/builds/llvm-project/mlir/examples/standalone -G "Ninja" -DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DCMAKE_C_COMPILER=/usr/bin/clang -DLLVM_ENABLE_LIBCXX=OFF -DMLIR_DIR=/var/lib/buildkite-agent/builds/llvm-project/build/lib/cmake/mlir -DLLVM_USE_LINKER=lld -DPython3_EXECUTABLE="/usr/bin/python3.9"

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
fhahn added a comment.Jan 4 2023, 8:51 AM

Thank you very much for continuing to push forward with this @deadalnix! Great to see a lot of progress already (e.g. the AArch64 regressions fixed!). I agree it would be good to fix as many regressions as possible before landing this.

n-omer added a subscriber: n-omer.Jan 9 2023, 8:38 AM
deadalnix updated this revision to Diff 487588.Jan 9 2023, 3:49 PM

Rebase on top of D140993 and a recent version of main,
and regenerate tests accordingly.

kparzysz added inline comments.Jan 10 2023, 9:10 AM
llvm/test/CodeGen/Hexagon/autohvx/isel-vpackew.ll
8

This is fine.

llvm/test/CodeGen/Hexagon/autohvx/mulh.ll
19

This isn't good. I'll take a look.

deadalnix added inline comments.Jan 10 2023, 12:09 PM
llvm/test/CodeGen/Hexagon/autohvx/mulh.ll
19

Thanks.

deadalnix updated this revision to Diff 488002.Jan 10 2023, 2:51 PM

Rebase.

One less regression, two more regressions :P

foad added inline comments.Jan 11 2023, 2:58 AM
llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll
237

Given this DAG:

SelectionDAG has 34 nodes:
  t0: ch,glue = EntryToken
                t94: i64,ch = CopyFromReg t0, Register:i64 %1
              t10: i64 = AssertAlign<16> t94
            t11: i64 = add nuw t10, Constant:i64<36>
          t91: v2i32,ch = load<(dereferenceable invariant load (s64) from %ir.arg.kernarg.offset, align 4, addrspace 4)> t0, t11, undef:i64
        t92: i64 = bitcast t91
      t99: i32,ch = load<(invariant load (s32) from %ir.arg.load, addrspace 4)> t0, t92, undef:i64
    t101: i32 = and t99, Constant:i32<65535>
  t102: i16 = truncate t101
                t49: i32 = any_extend t102
              t79: i32 = add t49, Constant:i32<12>
            t83: i32 = or t79, Constant:i32<4>
          t98: i16 = truncate t83
        t71: i16 = and t98, Constant:i16<255>
                  t38: i16 = srl t102, Constant:i16<8>
                t46: i32 = any_extend t38
              t81: i32 = add t46, Constant:i32<44>
            t84: i32 = or t81, Constant:i32<3>
          t95: i16 = truncate t84
        t61: i16 = shl t95, Constant:i16<8>
      t62: i16 = or t71, t61
    t34: ch = store<(store (s16) into `ptr addrspace(1) null`, addrspace 1)> t0, t62, Constant:i64<0>, undef:i64
  t28: ch = ENDPGM t34

If the first thing we do is call simplifyDemandedBits on t101, then it will be removed (replaced with t99) since the upper 16 bits are not demanded.

But if the first thing we do is combine t49 then it will be replaced with t101 (since any_extend of a truncate is a no-op), losing the fact that we didn't care about the upper 16 bits, and then t101 can no longer be removed.

deadalnix added inline comments.Jan 11 2023, 7:01 AM
llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll
237

Funny enough, this is exactly why we want to move the DAG to be fully processed in topological order. But as we enforce topological processing on some part of the DAG, it is always possible that in other parts where it is implementation defined, it stops being done in that order. Eventually, all of it will be done in this order.

On that one specifically, t49 eventually sink to t98: i16 = truncate t83 so simplifyDemandedBits still has the information it needs in the DAG. Have you looked at why it isn't finding it? Is it because it's simply too deep, and going that deep would be too costly?

On a side note, because order is implementation defined, there are likely instance of similar patterns in the wild where this doesn't get optimized.

kparzysz added inline comments.Jan 11 2023, 10:22 AM
llvm/test/CodeGen/Hexagon/autohvx/mulh.ll
19

This will be easy to fix. Please ping me once this patch is merged, and I'll take care of it. For the purpose of this patch, this is not an issue.

foad added inline comments.Jan 12 2023, 3:49 AM
llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll
237

If I understand correctly:

  1. we visit t49 and combine it into t101
  2. we visit t79 and do nothing
  3. because that did nothing, we do not revisit t83 and t98
  4. if we did revisit t98 I think demanded bits would allow us to remove the AND in t101
deadalnix added inline comments.Jan 13 2023, 12:35 AM
llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll
237

I think so, unless it is too deep.

foad added inline comments.Jan 13 2023, 2:09 AM
llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll
237

Is this a known deficiency in the DAGCombiner algorithm? I guess when we modify a node we add its immediate users to the worklist, but not all of its users-of-users and so on. Are there ways to work around this?

deadalnix added inline comments.Jan 13 2023, 7:20 AM
llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll
237

Right now, the order in which nodes are processed is implementation defined. This is a limitation of the current DAGCombiner. This patch is part of an effort to make the order topological.

foad added inline comments.Jan 13 2023, 7:42 AM
llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll
237

I understand that. Maybe I'm not phrasing my question very well.

Here's my understanding of the current behaviour even with your patch applied. We start off with:

      t49: i32 = any_extend t102
    t79: i32 = add t49, Constant:i32<12>
  t83: i32 = or t79, Constant:i32<4>
t98: i16 = truncate t83

We process t49, and we are able to simplify it, so we add its user t79 to the worklist. Then we process t79 but are not able to simplify it, so we do not add t83 (or t98) to the worklist.

But there is a problem here: the fact that we simplify t49 means that we could now make some progress if we tried to combine t98, because that would call SimplifyDemandedBits which can peek several levels "down" into the DAG, and simplify t102 or t101.

So there's a missed opportunity here. The fact that we simplified t49 unlocks a potential further simplification three levels "up" in the DAG, but we only ever add immediate users (one level "up") to the worklist when we simplify something.

deadalnix added inline comments.Jan 13 2023, 2:02 PM
llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll
237

I'm not sure where we are talking past each other so let me step back and explain how DAGCombine works and what we want to change about it.

First, the DAGCombiner adds all the nodes in the DAG in its worklist, in an implementation defined order. The actual order depends on the specific operation performed on the DAG before it reaches the combiner. In your example, that would include both t49 and t98.

Then it goes over all nodes and try to combine them. When a node is combined, it readds the result of the combine to the worklist as well as its users. In this case, the combiner visits t98 at some point, then t49, which it combines into t101, but then, it never revisit t98.

Visiting the DAG in topological order, you'd combine t49 before going over t98. This diff does part of the job, but clearly isn't the end of the story. However, it's one of the most disruptive step.

llvm/test/CodeGen/Hexagon/autohvx/mulh.ll
19

OK, thanks for looking into this.

deadalnix added inline comments.Jan 14 2023, 6:38 PM
llvm/test/CodeGen/X86/bitreverse.ll
18 ↗(On Diff #488002)

So this turns out to be cause by non topological processing, which is kind of ironic. We end up having (and (srl (and 65535 x) 1) 21845) which SimplifyDemandedBits should be able to take care of, except we already visited the outer and by the time we generate the inner and, so SimplifyDemandedBits never has a chance to do its job.

Is there a technique to sort this out? This is very similar to the problem @foad investigated.

deadalnix updated this revision to Diff 489859.Jan 17 2023, 9:35 AM

rebase on top of D141884

deadalnix updated this revision to Diff 490678.Jan 19 2023, 3:37 PM

Rebase, bitreverse.ll is definitively fixed.

deadalnix updated this revision to Diff 491218.Jan 22 2023, 6:30 PM

Rebase on top of @RKSimon 's element shuffle work.

deadalnix added inline comments.Jan 22 2023, 6:36 PM
llvm/test/CodeGen/X86/v8i1-masks.ll
207 ↗(On Diff #491218)

We have a new regression here :(

RKSimon added inline comments.Jan 23 2023, 2:03 AM
llvm/test/CodeGen/X86/v8i1-masks.ll
207 ↗(On Diff #491218)

Same regression - I just refactored the file recently to add AVX512 coverage

RKSimon added inline comments.Jan 23 2023, 9:28 AM
llvm/test/CodeGen/X86/any_extend_vector_inreg_of_broadcast_from_memory.ll
3671

We still need to look at this - vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1] and vmovdqa (%rdi), %xmm2 have been split but should be able to share the same load.

RKSimon added inline comments.Jan 23 2023, 9:29 AM
llvm/test/CodeGen/X86/insertelement-var-index.ll
2338

we're inserting the same load that we've already broadcast to the entire zmm?

deadalnix updated this revision to Diff 491668.Jan 24 2023, 1:46 AM

Rebase on top of the fix for v8i1-masks.ll

It might be interesting to make the previous order optional with a flag to llc for testing purposes, just to check which transforms are flaky/dependent on happenstance codes.

It might be interesting to make the previous order optional with a flag to llc for testing purposes, just to check which transforms are flaky/dependent on happenstance codes.

This would be useful for the initial stages when this is committed to trunk as I'm guessing we'll be fighting regressions for a while (which is why I reckon getting this committed shortly after the 16.0 cherry picks quieten down would be ideal) - hopefully it'd never get to a release branch though, or used in any test files.

deadalnix updated this revision to Diff 493290.Jan 30 2023, 5:41 AM

One more rebase and upgradign tests:

Matt added a subscriber: Matt.Feb 1 2023, 8:33 PM
deadalnix updated this revision to Diff 495668.Feb 7 2023, 3:49 PM

remove erroneous autogen note.

chfast added inline comments.Feb 8 2023, 12:19 AM
llvm/test/CodeGen/X86/avx512vl-vec-masked-cmp.ll
2663

Looks like fixed now.

deadalnix added inline comments.Feb 14 2023, 1:42 PM
llvm/test/CodeGen/X86/pr53419.ll
132

This file has now regressed :'(

RKSimon added a subscriber: kazu.Thu, Mar 16, 5:11 AM

Looks like we're very close to finally getting this in - @kazu @goldstein.w.n do you recognize any of the remaining regressions?

RKSimon commandeered this revision.Fri, Mar 17, 4:39 AM
RKSimon edited reviewers, added: deadalnix; removed: RKSimon.

(temporarily commandeering to rebase patch) @deadalnix please take this back when you're about

RKSimon added inline comments.Fri, Mar 17, 5:02 AM
llvm/test/CodeGen/X86/add-and-not.ll
305
SelectionDAG has 14 nodes:
  t0: ch,glue = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t15: i32 = truncate t2
          t16: i32 = xor t15, Constant:i32<-1>
        t19: i32 = and t16, Constant:i32<1>
      t20: i64 = zero_extend t19
    t6: i64 = add t2, t20
  t9: ch,glue = CopyToReg t0, Register:i64 $rax, t6
  t10: ch = X86ISD::RET_FLAG t9, TargetConstant:i32<0>, Register:i64 $rax, t9:1
llvm/test/CodeGen/X86/dagcombine-select.ll
217 ↗(On Diff #506040)

Looks like the truncate means we're now failing to call foldBinOpIntoSelect before:

SelectionDAG has 15 nodes:
  t0: ch,glue = EntryToken
  t2: i32,ch = CopyFromReg t0, Register:i32 %0
  t3: i8 = truncate t2
          t4: i1 = truncate t2
        t7: i32 = select t4, Constant:i32<2>, Constant:i32<3>
      t9: i8 = truncate t7
    t10: i32 = shl Constant:i32<1>, t9
  t13: ch,glue = CopyToReg t0, Register:i32 $eax, t10
  t14: ch = X86ISD::RET_FLAG t13, TargetConstant:i32<0>, Register:i32 $eax, t13:1

becomes:

SelectionDAG has 14 nodes:
  t0: ch,glue = EntryToken
            t2: i32,ch = CopyFromReg t0, Register:i32 %0
          t23: i8 = truncate t2
        t25: i8 = and t23, Constant:i8<1>
      t22: i8 = xor t25, Constant:i8<3>
    t10: i32 = shl Constant:i32<1>, t22
  t13: ch,glue = CopyToReg t0, Register:i32 $eax, t10
  t14: ch = X86ISD::RET_FLAG t13, TargetConstant:i32<0>, Register:i32 $eax, t13:1
RKSimon updated this revision to Diff 506561.Mon, Mar 20, 6:33 AM

address regressions in foldBinOpIntoSelect when handling shift(x, trunc/zext(y)) patterns

shift amount type canonicalization was preventing foldBinOpIntoSelect combines before other select-of-constant combines occurred

Unfortunately I haven't found a good way to test the fix separately from the main patch.