This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Don't sink the fdiv from (fmul (fdiv 1.0, %x), %y) if the fdiv isn't in the same basic block as the fmul
Needs ReviewPublic

Authored by craig.topper on Sep 10 2020, 12:48 PM.

Details

Summary

Loop invariant code motion can pull an fdiv out of by putting a reciprocal outside and using an fmul inside. InstCombine shouldn't reverse that.

I just did a simple basic block parent check rather than checking for the loop. I can call getLoop and isInvariant if we think that would be better.

Diff Detail

Event Timeline

craig.topper created this revision.Sep 10 2020, 12:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 10 2020, 12:48 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
craig.topper requested review of this revision.Sep 10 2020, 12:48 PM

Add a comment describing the exception

When does that happen?
-O3 seems to be fine: https://godbolt.org/z/1K3r56
Do we have InstCombine invocations after last LICM?

When does that happen?
-O3 seems to be fine: https://godbolt.org/z/1K3r56
Do we have InstCombine invocations after last LICM?

We found it in our internal code base so maybe there’s something different about our pipeline. But still seems like we shouldn’t have two passes making opposite choices.

When does that happen?
-O3 seems to be fine: https://godbolt.org/z/1K3r56
Do we have InstCombine invocations after last LICM?

We found it in our internal code base so maybe there’s something different about our pipeline.

But still seems like we shouldn’t have two passes making opposite choices.

I think this isn't an exception, there's a lot of opposite transforms.

When does that happen?
-O3 seems to be fine: https://godbolt.org/z/1K3r56
Do we have InstCombine invocations after last LICM?

We found it in our internal code base so maybe there’s something different about our pipeline.

But still seems like we shouldn’t have two passes making opposite choices.

I think this isn't an exception, there's a lot of opposite transforms.

I don't know if it helps, but the last LICM that fixes it is disabled with -fno-unroll-loops https://godbolt.org/z/dxjMxb Though for this simple case MachineLICM recovers it. But MachineLICM is much weaker than IR LICM on nested loops.

This sounds very much like what happened here:
https://llvm.org/PR46115
https://github.com/JuliaLang/julia/issues/36084

So either we need a much bigger change (instcombine should not do code sinking), or we try to enforce not running instcombine after LICM somehow?

We don't seem to run LICM after InstCombine in the LTO pipeline if I'm reading the pass manager correctly.

Fix copy paste mistake. Update phase ordering test that is now affected.

We don't seem to run LICM after InstCombine in the LTO pipeline if I'm reading the pass manager correctly.

That seems like the root bug...although a 1st hack at changing that didn't result in any regression test fails in llvm/test/Other/* .
If we are inverting this fdiv pattern, then aren't we doing that to all kinds of other loop invariant opportunities too?

We don't seem to run LICM after InstCombine in the LTO pipeline if I'm reading the pass manager correctly.

That seems like the root bug...although a 1st hack at changing that didn't result in any regression test fails in llvm/test/Other/* .
If we are inverting this fdiv pattern, then aren't we doing that to all kinds of other loop invariant opportunities too?

Actually, we do have some coverage for the new pass manager at least. I got that to wiggle with:

diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index bae84784628..acc937cb202 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -1576,6 +1576,9 @@ PassBuilder::buildLTODefaultPipeline(OptimizationLevel Level, bool DebugLogging,
   MainFPM.addPass(InstCombinePass());
   invokePeepholeEPCallbacks(MainFPM, Level);
   MainFPM.addPass(JumpThreadingPass(/*InsertFreezeWhenUnfoldingSelect*/ true));
+  MainFPM.addPass(createFunctionToLoopPassAdaptor(
+      LICMPass(PTO.LicmMssaOptCap, PTO.LicmMssaNoAccForPromotionCap),
+      EnableMSSALoopDependency, DebugLogging));
   MPM.addPass(createModuleToFunctionPassAdaptor(std::move(MainFPM)));
 
   // Create a function that performs CFI checks for cross-DSO calls with

We don't seem to run LICM after InstCombine in the LTO pipeline if I'm reading the pass manager correctly.

That seems like the root bug...although a 1st hack at changing that didn't result in any regression test fails in llvm/test/Other/* .
If we are inverting this fdiv pattern, then aren't we doing that to all kinds of other loop invariant opportunities too?

We likely are messing up other opportunities, but maybe their cost isn't as high as division so no one has noticed.

The case we spotted this on wasn't with LTO. So maybe we have another problem in our internal pipeline like Julia had.

It also seems to affect vectorization as the PhaseOrder test is showing. And we don't get it right with -fno-unroll-loops. Are those worth fixing?

lebedev.ri added a comment.EditedSep 11 2020, 10:54 AM

I personally really think this should be viewed as phase ordering issue, that should be addressed by adding/moving passes in pipeline.

I personally really think this should be viewed as phase ordering issue, that should be addressed by adding/moving pases in pipeline.

I agree. We have 3 likely problems already identified in this discussion: instcombine vs. LICM in LTO, instcombine vs. (no-)unroll, instcombine vs. loop-vectorizer.
The common opponent is instcombine of course because it does sinking transforms without cost models.

Maybe PR4268 is similar issue..

Maybe PR4268 is similar issue..

Wrong bug number?

lebedev.ri requested changes to this revision.Oct 8 2020, 3:45 AM

Just marking as reviewed since the review appears to be stalling.

This revision now requires changes to proceed.Oct 8 2020, 3:45 AM

@craig.topper Are you still looking at this?

@craig.topper Are you still looking at this?

No, not since I switched employers.

@craig.topper Are you still looking at this?

No, not since I switched employers.

Would it be OK if myself or @spatel commandeered this?

@craig.topper Are you still looking at this?

No, not since I switched employers.

Would it be OK if myself or @spatel commandeered this?

Absolutely.

The -Ofast -fno-unroll-loops example still shows up in the 12.0 release, but is fixed in trunk:
https://godbolt.org/z/4vss8qrTG

I didn't track down the commit that changed it, but I have added a regression test, so it will stick:
31e997fda1c6

The x86 asm also shows an interesting trade-off: SDAG could form a reciprocal estimate when the fdiv was sunk back into the loop...but it couldn't hoist the refinement math, so that's likely worse perf.

lebedev.ri resigned from this revision.Jan 12 2023, 4:50 PM

This review seems to be stuck/dead, consider abandoning if no longer relevant.

This revision now requires review to proceed.Jan 12 2023, 4:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 4:50 PM
Herald added a subscriber: StephenFan. · View Herald Transcript
craig.topper added inline comments.Feb 10 2023, 12:59 PM
llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
531

I found another variation of this

(X/Y) outside loop
Mul with Z inside loop.

->

X*Z ends up inside loop
/ Y ends up in loop

->

recip of Y outside loop

X*Z in loop
mul with reciprocal of Y inside loop.

So we went from an fdiv outside loop and one fmul inside the loop to 1 div outside the loop and 2 fmuls inside the loop.

spatel added inline comments.Feb 13 2023, 8:28 AM
llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
531

Does this model it:
https://godbolt.org/z/W4Ta9jzEn ?
...although MachineLICM is managing to pull the fmuls out of the loop there, so there must be some other complication?

craig.topper added inline comments.Feb 13 2023, 3:34 PM
llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
531

MachineLICM doesn't fix it on pure scalar.

https://godbolt.org/z/zGvxxqaTP