This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Fix non-determinism problem related to argument evaluation order in visitFDIV
ClosedPublic

Authored by bjope on Mar 17 2020, 1:30 PM.

Details

Summary

For some reason the order in which we call getNegatedExpression
for the involved operands, after a call to isCheaperToUseNegatedFPOps,
seem to matter. This patch includes a new test case in
test/CodeGen/X86/fdiv.ll that crashes if we reverse the order of
those calls. Before this patch that could happen depending on
which compiler that were used when buildind llvm. With my GCC
version (7.4.0) I got the crash, because it seems like it is
using a different order for the argument evaluation compared
to clang.

All other users of isCheaperToUseNegatedFPOps already used this
pattern with unfolded/ordered calls to getNegatedExpression, so
this patch is aligning visitFDIV with the other use cases.

Diff Detail

Event Timeline

bjope created this revision.Mar 17 2020, 1:30 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2020, 1:30 PM
bjope updated this revision to Diff 250926.Mar 17 2020, 3:19 PM

Fix formatting.

spatel added inline comments.Mar 19 2020, 6:19 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13063–13066

If you reverse the order of these statements, does the gcc-built-LLVM still crash? Ie, is this a reliable/reproducible failure and fix?

spatel added inline comments.Mar 19 2020, 6:23 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13063–13066

Oops - just re-read the description text, and the answer is "yes".
Can you post the exact assert/unreachable that you are seeing?

bjope added inline comments.Mar 19 2020, 6:49 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13063–13066

I didn't have the exact stack trace from the x86 reproducer at hand, but the below should give the picture.

And as you may have discovered I get the crash both when building llc using clang and gcc, as soon as I reverse the order here. So it should hopefully be easy to reproduce it (unless there is something even nastier involved).

Unknown code
UNREACHABLE executed at ../lib/CodeGen/SelectionDAG/TargetLowering.cpp:5812!
Stack dump:
 #0 0x00000000027c7de5 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/.../compiler-clang/bin/llc+0x27c7de5)
 #1 0x00000000027c5a94 llvm::sys::RunSignalHandlers() (/.../compiler-clang/bin/llc+0x27c5a94)
 #2 0x00000000027c5bd3 SignalHandler(int) (/.../compiler-clang/bin/llc+0x27c5bd3)
 #3 0x00007fa8af489630 __restore_rt (/lib64/libpthread.so.0+0xf630)
 #4 0x00007fa8ae224377 raise (/lib64/libc.so.6+0x36377)
 #5 0x00007fa8ae225a68 abort (/lib64/libc.so.6+0x37a68)
 #6 0x00000000027463fa (/.../compiler-clang/bin/llc+0x27463fa)
 #7 0x000000000264fbcc llvm::TargetLowering::getNegatedExpression(llvm::SDValue, llvm::SelectionDAG&, bool, bool, unsigned int) const (/.../compiler-clang/bin/llc+0x264fbcc)
 #8 0x000000000264f620 llvm::TargetLowering::getNegatedExpression(llvm::SDValue, llvm::SelectionDAG&, bool, bool, unsigned int) const (/.../compiler-clang/bin/llc+0x264f620)
 #9 0x00000000025215c3 (anonymous namespace)::DAGCombiner::visitFDIV(llvm::SDNode*) (/.../compiler-clang/bin/llc+0x25215c3)
#10 0x0000000002523ed1 (anonymous namespace)::DAGCombiner::visit(llvm::SDNode*) (/.../compiler-clang/bin/llc+0x2523ed1)
#11 0x000000000252525c (anonymous namespace)::DAGCombiner::combine(llvm::SDNode*) (/.../compiler-clang/bin/llc+0x252525c)
#12 0x0000000002526d8d (anonymous namespace)::DAGCombiner::Run(llvm::CombineLevel) (/.../compiler-clang/bin/llc+0x2526d8d)
#13 0x0000000002529000 llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level) (/.../compiler-clang/bin/llc+0x2529000)
#14 0x0000000002626330 llvm::SelectionDAGISel::CodeGenAndEmitDAG() (/.../compiler-clang/bin/llc+0x2626330)
#15 0x000000000262ad58 llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) (/.../compiler-clang/bin/llc+0x262ad58)
#16 0x000000000262c709 llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) (.part.868) (/.../compiler-clang/bin/llc+0x262c709)
#17 0x000000000096e709 (anonymous namespace)::MyDAGToDAGISel::runOnMachineFunction(llvm::MachineFunction&) (/.../compiler-clang/bin/llc+0x96e709)
#18 0x0000000001d1b01f llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (/.../compiler-clang/bin/llc+0x1d1b01f)
#19 0x00000000020e564e llvm::FPPassManager::runOnFunction(llvm::Function&) (/.../compiler-clang/bin/llc+0x20e564e)
#20 0x00000000020e60f9 llvm::FPPassManager::runOnModule(llvm::Module&) (/.../compiler-clang/bin/llc+0x20e60f9)
#21 0x00000000020e64c1 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/.../compiler-clang/bin/llc+0x20e64c1)
#22 0x00000000006bb219 main (/.../compiler-clang/bin/llc+0x6bb219)
bjope marked an inline comment as done.Mar 19 2020, 6:59 AM
bjope added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13063–13066

So it is the "Unknown code" llvm_unreachable at the end of TargetLowering::getNegatedExpression that we hit.

bjope marked an inline comment as not done.Mar 19 2020, 6:59 AM
spatel added inline comments.Mar 19 2020, 8:36 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13063–13066

Thanks, so this bug exists independently of this change. Ie, I can get the crash when compiling LLVM with LLVM by commuting the fdiv operands in your test. Similarly, the bug should re-appear in the gcc-compiled binary after this fix by commuting the operands.

When we getNegatedExpression() on %sub4 in the test, it creates an extra use of %sub1. And that alters the subsequent getNegatedExpression() of %mul2.

So this patch is ok if you still want to make it, but we need to adjust the code and test comments. The underlying bug is still lurking.

spatel added inline comments.Mar 19 2020, 11:41 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13063–13066

Proposal for a better (but still incomplete) fix:
D76439

But I think we should go ahead with this patch too to reduce non-determinism.

bjope marked an inline comment as done.Mar 19 2020, 2:22 PM
bjope added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13063–13066

One idea I had was to make any current relationships between isCheaperToUseNegatedFPOps/getNegatedExpression/getNegatibleCost by letting isCheaperToUseNegatedFPOps return Neg0/Neg1.
But I suspected that the getNegatedExpression vs getNegatibleCost was something that could be more general (not only a problem when using isCheaperToUseNegatedFPOps), and returning a bool as well as a SDValue pair makes the use of isCheaperToUseNegatedFPOps a bit messier. So thanks for taking the closer look at this.

So what to do with this patch? I guess I can update the code comment (and commit msg) in this patch to mention the non-determinism as reason (and the getNegatedExpression/getNegatibleCost phase ordering bug as a side-note about how it was detected).

spatel accepted this revision.Mar 20 2020, 4:10 AM
spatel added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13063–13066

Yes, please update (or just remove?) the comments and update the commit msg. The code change itself and test LGTM. This patch will at least make the failure mode consistent (and we know how to reliably reproduce that).

This revision is now accepted and ready to land.Mar 20 2020, 4:10 AM
bjope marked an inline comment as done.Mar 20 2020, 8:21 AM
bjope added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13063–13066

Ok, thanks!
I've let the commit msg explain the diff here. And just kept some motivation for the added test case.

This revision was automatically updated to reflect the committed changes.

For some reason the order in which we call getNegatedExpression for the involved operands, after a call to isCheaperToUseNegatedFPOps, seem to matter.

Committing a change without understanding why the order of execution of getNegatedExpression matters seems a bit problematic - could you look into/explain why the order of two calls to that function that sounds pretty benign/order independent?

bjope added a comment.Mar 25 2020, 4:23 PM

For some reason the order in which we call getNegatedExpression for the involved operands, after a call to isCheaperToUseNegatedFPOps, seem to matter.

Committing a change without understanding why the order of execution of getNegatedExpression matters seems a bit problematic - could you look into/explain why the order of two calls to that function that sounds pretty benign/order independent?

Forgot to update the description here, but in the final commit msg (see https://reviews.llvm.org/rGd168b7778035af6cc795b2367ca7f379ce1a629e) there is a reference to https://reviews.llvm.org/D76439 which follows up on some of the problems with getNegatedExpression() and getNegatibleCost() depending on each other. That seems to have been a problem for some time. I don't know the details myself, but I guess one problem is that those methods call each other. And since at least getNegatedExpression() may rewrite the expression tree, and for example the result from getNegatibleCost may vary from time to time if doing a call to getNegatedExpression() in between (so the checks done by isCheaperToUseNegatedFPOps is not neccessarily valid after calling getNegatedExpression the first time).
As far as I understand one could see this as two problems:

  1. We want to get deterministic results from the compiler, and since getNegatedExpression() may rewrite the expression tree (well, at least add nodes and replace some uses) one might get different results depending on the order of calls to it (I think that was due to taking decisions based on "number of uses"). So my patch simply made sure the order is the same (not depending on argument evaluation order).
  2. Getting rid of weird dependencies between getNegatedExpression and getNegatedExpression (and in some sense also isCheaperToUseNegatedFPOps) to make this part of the code less error-prone.