This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Make sure combined nodes are added back to the worklist in topological order.
ClosedPublic

Authored by deadalnix on Jun 6 2022, 7:49 AM.

Details

Summary

Currently, a node and its users are added back to the worklist in reverse topological order after it is combined. This diff changes that order to be topological. This is part of a larger migration to get the DAGCombiner to process nodes in topological order.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
deadalnix updated this revision to Diff 491218.Jan 22 2023, 6:30 PM

Rebase on top of @RKSimon 's element shuffle work.

deadalnix added inline comments.Jan 22 2023, 6:36 PM
llvm/test/CodeGen/X86/v8i1-masks.ll
207 ↗(On Diff #491218)

We have a new regression here :(

RKSimon added inline comments.Jan 23 2023, 2:03 AM
llvm/test/CodeGen/X86/v8i1-masks.ll
207 ↗(On Diff #491218)

Same regression - I just refactored the file recently to add AVX512 coverage

RKSimon added inline comments.Jan 23 2023, 9:28 AM
llvm/test/CodeGen/X86/any_extend_vector_inreg_of_broadcast_from_memory.ll
3654

We still need to look at this - vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1] and vmovdqa (%rdi), %xmm2 have been split but should be able to share the same load.

RKSimon added inline comments.Jan 23 2023, 9:29 AM
llvm/test/CodeGen/X86/insertelement-var-index.ll
2338

we're inserting the same load that we've already broadcast to the entire zmm?

deadalnix updated this revision to Diff 491668.Jan 24 2023, 1:46 AM

Rebase on top of the fix for v8i1-masks.ll

It might be interesting to make the previous order optional with a flag to llc for testing purposes, just to check which transforms are flaky/dependent on happenstance codes.

It might be interesting to make the previous order optional with a flag to llc for testing purposes, just to check which transforms are flaky/dependent on happenstance codes.

This would be useful for the initial stages when this is committed to trunk as I'm guessing we'll be fighting regressions for a while (which is why I reckon getting this committed shortly after the 16.0 cherry picks quieten down would be ideal) - hopefully it'd never get to a release branch though, or used in any test files.

deadalnix updated this revision to Diff 493290.Jan 30 2023, 5:41 AM

One more rebase and upgradign tests:

Matt added a subscriber: Matt.Feb 1 2023, 8:33 PM
deadalnix updated this revision to Diff 495668.Feb 7 2023, 3:49 PM

remove erroneous autogen note.

chfast added inline comments.Feb 8 2023, 12:19 AM
llvm/test/CodeGen/X86/avx512vl-vec-masked-cmp.ll
2694

Looks like fixed now.

deadalnix added inline comments.Feb 14 2023, 1:42 PM
llvm/test/CodeGen/X86/pr53419.ll
132 ↗(On Diff #497430)

This file has now regressed :'(

RKSimon added a subscriber: kazu.Mar 16 2023, 5:11 AM

Looks like we're very close to finally getting this in - @kazu @goldstein.w.n do you recognize any of the remaining regressions?

RKSimon commandeered this revision.Mar 17 2023, 4:39 AM
RKSimon edited reviewers, added: deadalnix; removed: RKSimon.

(temporarily commandeering to rebase patch) @deadalnix please take this back when you're about

RKSimon added inline comments.Mar 17 2023, 5:02 AM
llvm/test/CodeGen/X86/add-and-not.ll
305 ↗(On Diff #506040)
SelectionDAG has 14 nodes:
  t0: ch,glue = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t15: i32 = truncate t2
          t16: i32 = xor t15, Constant:i32<-1>
        t19: i32 = and t16, Constant:i32<1>
      t20: i64 = zero_extend t19
    t6: i64 = add t2, t20
  t9: ch,glue = CopyToReg t0, Register:i64 $rax, t6
  t10: ch = X86ISD::RET_FLAG t9, TargetConstant:i32<0>, Register:i64 $rax, t9:1
llvm/test/CodeGen/X86/dagcombine-select.ll
217

Looks like the truncate means we're now failing to call foldBinOpIntoSelect before:

SelectionDAG has 15 nodes:
  t0: ch,glue = EntryToken
  t2: i32,ch = CopyFromReg t0, Register:i32 %0
  t3: i8 = truncate t2
          t4: i1 = truncate t2
        t7: i32 = select t4, Constant:i32<2>, Constant:i32<3>
      t9: i8 = truncate t7
    t10: i32 = shl Constant:i32<1>, t9
  t13: ch,glue = CopyToReg t0, Register:i32 $eax, t10
  t14: ch = X86ISD::RET_FLAG t13, TargetConstant:i32<0>, Register:i32 $eax, t13:1

becomes:

SelectionDAG has 14 nodes:
  t0: ch,glue = EntryToken
            t2: i32,ch = CopyFromReg t0, Register:i32 %0
          t23: i8 = truncate t2
        t25: i8 = and t23, Constant:i8<1>
      t22: i8 = xor t25, Constant:i8<3>
    t10: i32 = shl Constant:i32<1>, t22
  t13: ch,glue = CopyToReg t0, Register:i32 $eax, t10
  t14: ch = X86ISD::RET_FLAG t13, TargetConstant:i32<0>, Register:i32 $eax, t13:1
RKSimon updated this revision to Diff 506561.Mar 20 2023, 6:33 AM

address regressions in foldBinOpIntoSelect when handling shift(x, trunc/zext(y)) patterns

shift amount type canonicalization was preventing foldBinOpIntoSelect combines before other select-of-constant combines occurred

Unfortunately I haven't found a good way to test the fix separately from the main patch.

@RKSimon You did well to take this over.

llvm/test/CodeGen/X86/add-and-not.ll
305 ↗(On Diff #510916)

I was able to repro this one standalone in D147821 .

For some reason, this fix work on 32 bits, but there is still a missing piece for 64bits.

deadalnix added inline comments.Apr 7 2023, 4:53 PM
llvm/test/CodeGen/X86/add-and-not.ll
305 ↗(On Diff #510916)

D147827 sorts this out.

deadalnix added inline comments.Apr 8 2023, 7:40 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
2433–2439

Shouldn't this come in its own patch?

RKSimon added inline comments.Apr 10 2023, 9:23 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
2433–2439

Yes, I just hadn't found a good way to test with trunk so far.

deadalnix commandeered this revision.Apr 11 2023, 3:40 PM
deadalnix edited reviewers, added: RKSimon; removed: deadalnix.
deadalnix updated this revision to Diff 512679.Apr 12 2023, 1:00 AM

Regen LLVM.CodeGen/PowerPC::select_const.ll

RKSimon added inline comments.Apr 12 2023, 3:05 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
47324 ↗(On Diff #512679)

I've removed this assertion in rGb20c1ffe8f3e as part of PR60007

deadalnix added inline comments.Apr 12 2023, 4:15 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
47324 ↗(On Diff #512679)

Will rebase.

deadalnix updated this revision to Diff 516403.Apr 24 2023, 7:18 AM

Rebase, unfortunately, we have a big regression on RISCV/pr58511.ll

RKSimon added inline comments.May 7 2023, 12:59 PM
llvm/test/CodeGen/X86/insert-into-constant-vector.ll
28 ↗(On Diff #516403)

Fix RISCV's pr58511.ll regression

deadalnix added inline comments.May 18 2023, 12:23 PM
llvm/test/CodeGen/X86/vector-reduce-or-bool.ll
507–509

Do we care about these? They replace a load with a materialization via vpbroadcast, and it's not clear to me that is actually worse. It's not clear to me either when one option is picked over the other.

RKSimon added inline comments.May 18 2023, 1:36 PM
llvm/test/CodeGen/X86/vector-reduce-or-bool.ll
507–509

Don't worry about these - its an existing problem with build vector constants

deadalnix added inline comments.May 19 2023, 1:47 PM
llvm/test/CodeGen/X86/vector-reduce-or-bool.ll
508

Ok, what outstanding issue do we have left? It seems like this is close to good to go.

Yes, I think this is very close now - please can you get some test-suite numbers to highlight any perf differences?

Yes, I think this is very close now - please can you get some test-suite numbers to highlight any perf differences?

I ran the test suite with and without the patch. The perf difference is well within the measurement noise if there is one at all.

Thanks, we're definitely very close now - please can you update the summary to something closer to a commit message

deadalnix edited the summary of this revision. (Show Details)May 23 2023, 4:10 PM
deadalnix edited the summary of this revision. (Show Details)
foad added a comment.May 24 2023, 2:40 AM

FYI I ran this patch on an internal corpus of AMDGPU graphics shaders and didn't see anything alarming. There was a slight change in the way fmul+fadd are combined into fma, but I don't think it was consistently worse - just different. Anyway I will work on that, but I do not think it should block this patch.

FYI I ran this patch on an internal corpus of AMDGPU graphics shaders and didn't see anything alarming. There was a slight change in the way fmul+fadd are combined into fma, but I don't think it was consistently worse - just different. Anyway I will work on that, but I do not think it should block this patch.

The tests I ran also looked fine. Nothing looked worrying in what I tried.

RKSimon retitled this revision from [RFC][DAGCombine] Make sure combined nodes are added back to the worklist in topological order. to [DAGCombine] Make sure combined nodes are added back to the worklist in topological order..May 24 2023, 6:00 AM
deadalnix updated this revision to Diff 526435.May 29 2023, 6:10 AM

Rebase and fix tests

@deadalnix Please can you rebase again?

foad added a comment.Jun 1 2023, 7:24 AM

FYI I ran this patch on an internal corpus of AMDGPU graphics shaders and didn't see anything alarming. There was a slight change in the way fmul+fadd are combined into fma, but I don't think it was consistently worse - just different. Anyway I will work on that, but I do not think it should block this patch.

I added a test case @fma_vs_output_modifier in test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll so if you rebase now you will see a new regression. D151890 fixes it.

RKSimon added a subscriber: nikic.Jun 1 2023, 7:37 AM

I think the next step for this patch is to decide how to get it committed, I'm expecting there will be a few perf regressions that the current tests miss (or only vaguely hint at), but IMO these are grossly outweighed by the benefits we're seeing.

But I'm worried it will end up being stuck in a reversion/re-commit loop for every report.

Does anyone else have any thoughts on this? @foad @nikic @craig.topper ?

foad added a comment.Jun 1 2023, 8:03 AM

Does anyone else have any thoughts on this? @foad @nikic @craig.topper ?

Not really :/ For my use cases (graphics shaders on AMDGPU) I am not too concerned, but to be safe we could run some fairly extensive performance tests. It would probably take about a week to get results from that.

nikic added a comment.Jun 1 2023, 8:15 AM

I think you can just go ahead and land this. At this point it doesn't seem like this is going to cause systematic regressions -- and for individual cases we can fix forward.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
2489

For truncates, don't we need to make sure no significant bits of the shamt get truncated?

RKSimon added inline comments.Jun 1 2023, 8:24 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
2489

Yes, we should probably drop this change from the patch, accept the regression and address it in a followup

deadalnix added inline comments.Jun 1 2023, 9:52 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
2489

Fair enough.

deadalnix updated this revision to Diff 527539.Jun 1 2023, 11:27 AM

Rebase and remove the "Peek through any trunc/zext to shift amount type." change.

deadalnix added inline comments.Jun 1 2023, 11:29 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
2489

I extracted this into D151916

If people are not too worried, I think we shoudl land this, because it's time consuming to keep it up to date as there are a lot of conflicts when rebasing.

deadalnix updated this revision to Diff 527548.Jun 1 2023, 11:56 AM

Fix CodeGen/PowerPC/select_const.ll

RKSimon accepted this revision.Jun 4 2023, 11:06 AM

LGTM - I think this is ready to land now

This revision is now accepted and ready to land.Jun 4 2023, 11:06 AM
deadalnix added inline comments.Jun 5 2023, 4:12 AM
llvm/test/CodeGen/Hexagon/autohvx/mulh.ll
19

@kparzysz This has been landed, your move now :)

foad added a comment.Jun 5 2023, 4:14 AM

@deadalnix congratulations and good luck! :)

jplehr added a subscriber: jplehr.Jun 5 2023, 6:47 AM

It seems that this broke the AMDGPU OpenMP runtime buildbot (https://lab.llvm.org/buildbot/#/builders/193/builds/32362).
I reverted locally and the build works again.
The issue I'm seeing is that the build of the file kmp_affinity.cpp

I also see build timeouts on my builds

It seems that this broke the AMDGPU OpenMP runtime buildbot (https://lab.llvm.org/buildbot/#/builders/193/builds/32362).
I reverted locally and the build works again.
The issue I'm seeing is that the build of the file kmp_affinity.cpp

In a 2-stage build, D127115 caused Clang to end in an infinite loop when compiling some files to.
This applies to a few other files, including llvm/lib/IR/AsmWriter.cpp

RKSimon reopened this revision.Jun 5 2023, 2:28 PM

Investigate infinite loops in stage2 compilers

This revision is now accepted and ready to land.Jun 5 2023, 2:28 PM
RKSimon requested changes to this revision.Jun 5 2023, 2:28 PM
This revision now requires changes to proceed.Jun 5 2023, 2:28 PM

@deadalnix I've reproduced the stage2 infinite loop on AsmWriter.cpp - currently trying to get bugpoint to reduce it.

define void @_ZN12_GLOBAL__N_114AssemblyWriterC2ERN4llvm21formatted_raw_ostreamERNS1_11SlotTrackerEPKNS1_6ModuleEPNS1_24AssemblyAnnotationWriterEbb() {
entry:
  %__end1.sroa.5.0.copyload = load ptr, ptr poison, align 8
  %__end1.sroa.6.0.copyload = load ptr, ptr null, align 8
  %cmp.i.i.i8.i.i = icmp ne ptr null, %__end1.sroa.6.0.copyload
  %cmp.i.i.i.i9.i.i = icmp ne ptr null, %__end1.sroa.5.0.copyload
  %.not.i = select i1 %cmp.i.i.i8.i.i, i1 true, i1 %cmp.i.i.i.i9.i.i
  %.not.i.fr = freeze i1 %.not.i
  br i1 %.not.i.fr, label %for.cond.us, label %if.end.split

for.cond.us:                                      ; preds = %entry
  unreachable

if.end.split:                                     ; preds = %entry
  ret void
}
n-omer added a comment.Jun 6 2023, 7:34 AM
define void @_ZN12_GLOBAL__N_114AssemblyWriterC2ERN4llvm21formatted_raw_ostreamERNS1_11SlotTrackerEPKNS1_6ModuleEPNS1_24AssemblyAnnotationWriterEbb() {
entry:
  %__end1.sroa.5.0.copyload = load ptr, ptr poison, align 8
  %__end1.sroa.6.0.copyload = load ptr, ptr null, align 8
  %cmp.i.i.i8.i.i = icmp ne ptr null, %__end1.sroa.6.0.copyload
  %cmp.i.i.i.i9.i.i = icmp ne ptr null, %__end1.sroa.5.0.copyload
  %.not.i = select i1 %cmp.i.i.i8.i.i, i1 true, i1 %cmp.i.i.i.i9.i.i
  %.not.i.fr = freeze i1 %.not.i
  br i1 %.not.i.fr, label %for.cond.us, label %if.end.split

for.cond.us:                                      ; preds = %entry
  unreachable

if.end.split:                                     ; preds = %entry
  ret void
}

We've also identified this issue in an internal codebase built with this patch, the problem is that // X != Y --> (X^Y) in TargetLowering::SimplifySetCC and Transform (brcond (xor x, y)) -> (brcond (setcc, x, y, ne)) in DAGCombiner::rebuildSetCC keep undoing each other without an end:

Combining: t1517: ch = brcond t0, t1520, BasicBlock:ch<if.end.split 0x1ae28ac5070>
Creating new node: t1522: i1 = setcc t1521, Constant:i1<-1>, setne:ch
Creating new node: t1523: ch = brcond t0, t1522, BasicBlock:ch<if.end.split 0x1ae28ac5070>
 ... into: t1523: ch = brcond t0, t1522, BasicBlock:ch<if.end.split 0x1ae28ac5070>

Combining: t1521: i1 = freeze t9

Combining: t1523: ch = brcond t0, t1522, BasicBlock:ch<if.end.split 0x1ae28ac5070>

Combining: t1522: i1 = setcc t1521, Constant:i1<-1>, setne:ch
Creating new node: t1524: i1 = setcc t9, Constant:i1<-1>, setne:ch
Creating new node: t1525: i1 = freeze t1524
 ... into: t1525: i1 = freeze t1524

Combining: t1525: i1 = freeze t1524

Combining: t1524: i1 = setcc t9, Constant:i1<-1>, setne:ch
Creating new node: t1526: i1 = xor t9, Constant:i1<-1>
 ... into: t1526: i1 = xor t9, Constant:i1<-1>

Combining: t1526: i1 = xor t9, Constant:i1<-1>

Combining: t1525: i1 = freeze t1526
Creating new node: t1527: i1 = freeze t9
 ... into: t1526: i1 = xor t1527, Constant:i1<-1>

Combining: t1526: i1 = xor t1527, Constant:i1<-1>

Combining: t1527: i1 = freeze t9

Combining: t1523: ch = brcond t0, t1526, BasicBlock:ch<if.end.split 0x1ae28ac5070> --------> Back to where we started.
Creating new node: t1528: i1 = setcc t1527, Constant:i1<-1>, setne:ch
Creating new node: t1529: ch = brcond t0, t1528, BasicBlock:ch<if.end.split 0x1ae28ac5070>
 ... into: t1529: ch = brcond t0, t1528, BasicBlock:ch<if.end.split 0x1ae28ac5070>

We've also identified this issue in an internal codebase built with this patch, the problem is that // X != Y --> (X^Y) in TargetLowering::SimplifySetCC and Transform (brcond (xor x, y)) -> (brcond (setcc, x, y, ne)) in DAGCombiner::rebuildSetCC keep undoing each other without an end:

That seems relatively easy to fix, the question is, which one do we want?

define void @_ZN12_GLOBAL__N_114AssemblyWriterC2ERN4llvm21formatted_raw_ostreamERNS1_11SlotTrackerEPKNS1_6ModuleEPNS1_24AssemblyAnnotationWriterEbb() {
entry:
  %__end1.sroa.5.0.copyload = load ptr, ptr poison, align 8
  %__end1.sroa.6.0.copyload = load ptr, ptr null, align 8
  %cmp.i.i.i8.i.i = icmp ne ptr null, %__end1.sroa.6.0.copyload
  %cmp.i.i.i.i9.i.i = icmp ne ptr null, %__end1.sroa.5.0.copyload
  %.not.i = select i1 %cmp.i.i.i8.i.i, i1 true, i1 %cmp.i.i.i.i9.i.i
  %.not.i.fr = freeze i1 %.not.i
  br i1 %.not.i.fr, label %for.cond.us, label %if.end.split

for.cond.us:                                      ; preds = %entry
  unreachable

if.end.split:                                     ; preds = %entry
  ret void
}

I am failing to see how this generates an infinite loop. llc compiles it just fine. Would you have more details on how to repro?

n-omer added a comment.Jun 7 2023, 7:21 AM
define void @_ZN12_GLOBAL__N_114AssemblyWriterC2ERN4llvm21formatted_raw_ostreamERNS1_11SlotTrackerEPKNS1_6ModuleEPNS1_24AssemblyAnnotationWriterEbb() {
entry:
  %__end1.sroa.5.0.copyload = load ptr, ptr poison, align 8
  %__end1.sroa.6.0.copyload = load ptr, ptr null, align 8
  %cmp.i.i.i8.i.i = icmp ne ptr null, %__end1.sroa.6.0.copyload
  %cmp.i.i.i.i9.i.i = icmp ne ptr null, %__end1.sroa.5.0.copyload
  %.not.i = select i1 %cmp.i.i.i8.i.i, i1 true, i1 %cmp.i.i.i.i9.i.i
  %.not.i.fr = freeze i1 %.not.i
  br i1 %.not.i.fr, label %for.cond.us, label %if.end.split

for.cond.us:                                      ; preds = %entry
  unreachable

if.end.split:                                     ; preds = %entry
  ret void
}

I am failing to see how this generates an infinite loop. llc compiles it just fine. Would you have more details on how to repro?

With this patch applied on top of main@2011ad0cbbf52a6f3b7bf76aa40578d3ff9fd60d I'm able to reproduce the infinite loop with llc.

I have this one, which repro and is simpler:

define i64 @foo(i1 %0) {
  br label %2

2:
  %3 = select i1 %0, i1 %0, i1 false
  %4 = freeze i1 %3
  br i1 %4, label %5, label %6

5:
  br label %6

6:
  %7 = phi i64 [ 0, %5 ], [ 1, %2 ]
  ret i64 %7
}

I'll figure out a fix. That doesn't sound complicated.

I can confirm D152430 indeed fixes the infinite loop.

deadalnix updated this revision to Diff 529937.Jun 9 2023, 6:07 AM

Rebase on top of D152430

Are you now able to create a stage2 build?

@deadalnix Please can you pre-commit the brcond regression test(s) and rebase on D152544 / 5c6ff3a6025570479da5b72fcd02ca93b470683b

@deadalnix Please can you pre-commit the brcond regression test(s) and rebase on D152544 / 5c6ff3a6025570479da5b72fcd02ca93b470683b

The regression test has been precommitted already in aa5a1eaa38bbcff64e22a0e0662843d119d3d71f

I'm rebasing this one now.

deadalnix updated this revision to Diff 530482.Jun 12 2023, 6:05 AM

Rebase on top of D152544

As far as I know, the infinite loop problem is solved now, do we have any outstanding issue? Shall we give this another try?

RKSimon accepted this revision.Jun 12 2023, 6:30 AM

LGTM

This revision is now accepted and ready to land.Jun 12 2023, 6:30 AM

Thanks for working on this and good luck.

I've been fuzzing this change since one day before it landed (for the second time) hoping to find some other compilation hangs. So far so good.

kparzysz added inline comments.Jun 16 2023, 1:01 PM
llvm/test/CodeGen/Hexagon/autohvx/mulh.ll
19

Excellent. Working on it.