Currently, a node and its users are added back to the worklist in reverse topological order after it is combined. This diff changes that order to be topological. This is part of a larger migration to get the DAGCombiner to process nodes in topological order.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/CodeGen/X86/v8i1-masks.ll | ||
---|---|---|
207 ↗ | (On Diff #491218) | We have a new regression here :( |
llvm/test/CodeGen/X86/v8i1-masks.ll | ||
---|---|---|
207 ↗ | (On Diff #491218) | Same regression - I just refactored the file recently to add AVX512 coverage |
llvm/test/CodeGen/X86/any_extend_vector_inreg_of_broadcast_from_memory.ll | ||
---|---|---|
3654 | We still need to look at this - vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1] and vmovdqa (%rdi), %xmm2 have been split but should be able to share the same load. |
llvm/test/CodeGen/X86/insertelement-var-index.ll | ||
---|---|---|
2338 | we're inserting the same load that we've already broadcast to the entire zmm? |
It might be interesting to make the previous order optional with a flag to llc for testing purposes, just to check which transforms are flaky/dependent on happenstance codes.
This would be useful for the initial stages when this is committed to trunk as I'm guessing we'll be fighting regressions for a while (which is why I reckon getting this committed shortly after the 16.0 cherry picks quieten down would be ideal) - hopefully it'd never get to a release branch though, or used in any test files.
One more rebase and upgradign tests:
- Several tests where converted to opaque pointer type.
- rG37bc62ed0a24303aa572155009358b8937ab8b4c
llvm/test/CodeGen/X86/avx512vl-vec-masked-cmp.ll | ||
---|---|---|
2694 | Looks like fixed now. |
llvm/test/CodeGen/X86/pr53419.ll | ||
---|---|---|
132 ↗ | (On Diff #497430) | This file has now regressed :'( |
Looks like we're very close to finally getting this in - @kazu @goldstein.w.n do you recognize any of the remaining regressions?
(temporarily commandeering to rebase patch) @deadalnix please take this back when you're about
llvm/test/CodeGen/X86/add-and-not.ll | ||
---|---|---|
305 ↗ | (On Diff #506040) | SelectionDAG has 14 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t15: i32 = truncate t2 t16: i32 = xor t15, Constant:i32<-1> t19: i32 = and t16, Constant:i32<1> t20: i64 = zero_extend t19 t6: i64 = add t2, t20 t9: ch,glue = CopyToReg t0, Register:i64 $rax, t6 t10: ch = X86ISD::RET_FLAG t9, TargetConstant:i32<0>, Register:i64 $rax, t9:1 |
llvm/test/CodeGen/X86/dagcombine-select.ll | ||
217 | Looks like the truncate means we're now failing to call foldBinOpIntoSelect before: SelectionDAG has 15 nodes: t0: ch,glue = EntryToken t2: i32,ch = CopyFromReg t0, Register:i32 %0 t3: i8 = truncate t2 t4: i1 = truncate t2 t7: i32 = select t4, Constant:i32<2>, Constant:i32<3> t9: i8 = truncate t7 t10: i32 = shl Constant:i32<1>, t9 t13: ch,glue = CopyToReg t0, Register:i32 $eax, t10 t14: ch = X86ISD::RET_FLAG t13, TargetConstant:i32<0>, Register:i32 $eax, t13:1 becomes: SelectionDAG has 14 nodes: t0: ch,glue = EntryToken t2: i32,ch = CopyFromReg t0, Register:i32 %0 t23: i8 = truncate t2 t25: i8 = and t23, Constant:i8<1> t22: i8 = xor t25, Constant:i8<3> t10: i32 = shl Constant:i32<1>, t22 t13: ch,glue = CopyToReg t0, Register:i32 $eax, t10 t14: ch = X86ISD::RET_FLAG t13, TargetConstant:i32<0>, Register:i32 $eax, t13:1 |
address regressions in foldBinOpIntoSelect when handling shift(x, trunc/zext(y)) patterns
shift amount type canonicalization was preventing foldBinOpIntoSelect combines before other select-of-constant combines occurred
Unfortunately I haven't found a good way to test the fix separately from the main patch.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
2433–2439 | Shouldn't this come in its own patch? |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
2433–2439 | Yes, I just hadn't found a good way to test with trunk so far. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
47324 ↗ | (On Diff #512679) | I've removed this assertion in rGb20c1ffe8f3e as part of PR60007 |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
47324 ↗ | (On Diff #512679) | Will rebase. |
llvm/test/CodeGen/X86/insert-into-constant-vector.ll | ||
---|---|---|
28 ↗ | (On Diff #516403) | These should be fixed by rG17dd1ad14be77c722f7c7c1e4fa273c6f170abea |
llvm/test/CodeGen/X86/vector-reduce-or-bool.ll | ||
---|---|---|
507–509 | Do we care about these? They replace a load with a materialization via vpbroadcast, and it's not clear to me that is actually worse. It's not clear to me either when one option is picked over the other. |
llvm/test/CodeGen/X86/vector-reduce-or-bool.ll | ||
---|---|---|
507–509 | Don't worry about these - its an existing problem with build vector constants |
llvm/test/CodeGen/X86/vector-reduce-or-bool.ll | ||
---|---|---|
508 | Ok, what outstanding issue do we have left? It seems like this is close to good to go. |
Yes, I think this is very close now - please can you get some test-suite numbers to highlight any perf differences?
I ran the test suite with and without the patch. The perf difference is well within the measurement noise if there is one at all.
Thanks, we're definitely very close now - please can you update the summary to something closer to a commit message
FYI I ran this patch on an internal corpus of AMDGPU graphics shaders and didn't see anything alarming. There was a slight change in the way fmul+fadd are combined into fma, but I don't think it was consistently worse - just different. Anyway I will work on that, but I do not think it should block this patch.
FYI I ran this patch on an internal corpus of AMDGPU graphics shaders and didn't see anything alarming. There was a slight change in the way fmul+fadd are combined into fma, but I don't think it was consistently worse - just different. Anyway I will work on that, but I do not think it should block this patch.
I added a test case @fma_vs_output_modifier in test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll so if you rebase now you will see a new regression. D151890 fixes it.
I think the next step for this patch is to decide how to get it committed, I'm expecting there will be a few perf regressions that the current tests miss (or only vaguely hint at), but IMO these are grossly outweighed by the benefits we're seeing.
But I'm worried it will end up being stuck in a reversion/re-commit loop for every report.
Does anyone else have any thoughts on this? @foad @nikic @craig.topper ?
Does anyone else have any thoughts on this? @foad @nikic @craig.topper ?
Not really :/ For my use cases (graphics shaders on AMDGPU) I am not too concerned, but to be safe we could run some fairly extensive performance tests. It would probably take about a week to get results from that.
I think you can just go ahead and land this. At this point it doesn't seem like this is going to cause systematic regressions -- and for individual cases we can fix forward.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
2489 | For truncates, don't we need to make sure no significant bits of the shamt get truncated? |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
2489 | Yes, we should probably drop this change from the patch, accept the regression and address it in a followup |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
2489 | Fair enough. |
If people are not too worried, I think we shoudl land this, because it's time consuming to keep it up to date as there are a lot of conflicts when rebasing.
It seems that this broke the AMDGPU OpenMP runtime buildbot (https://lab.llvm.org/buildbot/#/builders/193/builds/32362).
I reverted locally and the build works again.
The issue I'm seeing is that the build of the file kmp_affinity.cpp
In a 2-stage build, D127115 caused Clang to end in an infinite loop when compiling some files to.
This applies to a few other files, including llvm/lib/IR/AsmWriter.cpp
@deadalnix I've reproduced the stage2 infinite loop on AsmWriter.cpp - currently trying to get bugpoint to reduce it.
define void @_ZN12_GLOBAL__N_114AssemblyWriterC2ERN4llvm21formatted_raw_ostreamERNS1_11SlotTrackerEPKNS1_6ModuleEPNS1_24AssemblyAnnotationWriterEbb() { entry: %__end1.sroa.5.0.copyload = load ptr, ptr poison, align 8 %__end1.sroa.6.0.copyload = load ptr, ptr null, align 8 %cmp.i.i.i8.i.i = icmp ne ptr null, %__end1.sroa.6.0.copyload %cmp.i.i.i.i9.i.i = icmp ne ptr null, %__end1.sroa.5.0.copyload %.not.i = select i1 %cmp.i.i.i8.i.i, i1 true, i1 %cmp.i.i.i.i9.i.i %.not.i.fr = freeze i1 %.not.i br i1 %.not.i.fr, label %for.cond.us, label %if.end.split for.cond.us: ; preds = %entry unreachable if.end.split: ; preds = %entry ret void }
We've also identified this issue in an internal codebase built with this patch, the problem is that // X != Y --> (X^Y) in TargetLowering::SimplifySetCC and Transform (brcond (xor x, y)) -> (brcond (setcc, x, y, ne)) in DAGCombiner::rebuildSetCC keep undoing each other without an end:
Combining: t1517: ch = brcond t0, t1520, BasicBlock:ch<if.end.split 0x1ae28ac5070> Creating new node: t1522: i1 = setcc t1521, Constant:i1<-1>, setne:ch Creating new node: t1523: ch = brcond t0, t1522, BasicBlock:ch<if.end.split 0x1ae28ac5070> ... into: t1523: ch = brcond t0, t1522, BasicBlock:ch<if.end.split 0x1ae28ac5070> Combining: t1521: i1 = freeze t9 Combining: t1523: ch = brcond t0, t1522, BasicBlock:ch<if.end.split 0x1ae28ac5070> Combining: t1522: i1 = setcc t1521, Constant:i1<-1>, setne:ch Creating new node: t1524: i1 = setcc t9, Constant:i1<-1>, setne:ch Creating new node: t1525: i1 = freeze t1524 ... into: t1525: i1 = freeze t1524 Combining: t1525: i1 = freeze t1524 Combining: t1524: i1 = setcc t9, Constant:i1<-1>, setne:ch Creating new node: t1526: i1 = xor t9, Constant:i1<-1> ... into: t1526: i1 = xor t9, Constant:i1<-1> Combining: t1526: i1 = xor t9, Constant:i1<-1> Combining: t1525: i1 = freeze t1526 Creating new node: t1527: i1 = freeze t9 ... into: t1526: i1 = xor t1527, Constant:i1<-1> Combining: t1526: i1 = xor t1527, Constant:i1<-1> Combining: t1527: i1 = freeze t9 Combining: t1523: ch = brcond t0, t1526, BasicBlock:ch<if.end.split 0x1ae28ac5070> --------> Back to where we started. Creating new node: t1528: i1 = setcc t1527, Constant:i1<-1>, setne:ch Creating new node: t1529: ch = brcond t0, t1528, BasicBlock:ch<if.end.split 0x1ae28ac5070> ... into: t1529: ch = brcond t0, t1528, BasicBlock:ch<if.end.split 0x1ae28ac5070>
I am failing to see how this generates an infinite loop. llc compiles it just fine. Would you have more details on how to repro?
With this patch applied on top of main@2011ad0cbbf52a6f3b7bf76aa40578d3ff9fd60d I'm able to reproduce the infinite loop with llc.
I have this one, which repro and is simpler:
define i64 @foo(i1 %0) { br label %2 2: %3 = select i1 %0, i1 %0, i1 false %4 = freeze i1 %3 br i1 %4, label %5, label %6 5: br label %6 6: %7 = phi i64 [ 0, %5 ], [ 1, %2 ] ret i64 %7 }
I'll figure out a fix. That doesn't sound complicated.
@deadalnix Please can you pre-commit the brcond regression test(s) and rebase on D152544 / 5c6ff3a6025570479da5b72fcd02ca93b470683b
The regression test has been precommitted already in aa5a1eaa38bbcff64e22a0e0662843d119d3d71f
I'm rebasing this one now.
As far as I know, the infinite loop problem is solved now, do we have any outstanding issue? Shall we give this another try?
Thanks for working on this and good luck.
I've been fuzzing this change since one day before it landed (for the second time) hoping to find some other compilation hangs. So far so good.
llvm/test/CodeGen/Hexagon/autohvx/mulh.ll | ||
---|---|---|
19 | Excellent. Working on it. |
Shouldn't this come in its own patch?