This is an archive of the discontinued LLVM Phabricator instance.

[LTO] Ensure LICM hoists expensive fdiv instructions introduced by InstCombine
ClosedPublic

Authored by david-arm on Feb 9 2023, 1:12 AM.

Details

Summary

In the LTO pipeline we run InstCombine after LICM, which is
different to what we normally do without LTO. This has the
effect of undoing all the great work done by LICM to reduce
the cost of the loop when it hoists the fdiv out and replaces
it with fmul. When InstCombine runs after LICM it puts the
fdiv straight back which, on AArch64 at least, is darn
expensive. You can observe this problem in the SPEC2017
benchmark parest if you build with "-Ofast -flto" and the
loop-vectoriser uses an unroll factor of 1, which is what
often happens when tail-folding is enabled.

This is also a problem for scalar loops, or indeed any loop
where there is only one use of the preheader fdiv result in
the loop.

See InstCombinerImpl::visitFMul for the code that sinks the fdiv.

I've attempted to fix this by adding another LICM pass for Full
LTO after InstCombine. The alternative is to stop InstCombine
from sinking the fdiv into loops. See D87479 for a previous
discussion on this issue.

Diff Detail

Event Timeline

david-arm created this revision.Feb 9 2023, 1:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 9 2023, 1:12 AM
david-arm requested review of this revision.Feb 9 2023, 1:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 9 2023, 1:12 AM
david-arm edited the summary of this revision. (Show Details)Feb 9 2023, 1:21 AM

I've added another variation I encountered to D87479 which another run of LICM will not fix.

I've added another variation I encountered to D87479 which another run of LICM will not fix.

Would that case be fixed by disabling sinking in InstCombine? Maybe it's finally time to just drop sinking in InstCombine and deal with any fallout.

At least as a first check, compile-time sees a net improvement:
https://llvm-compile-time-tracker.com/compare.php?from=d87468e56c221555be7a1c1550036af5b7034896&to=a5691dd50d4098dec45571fcdfdc5403826a07de&stat=instructions:u

That change should also allow us to close a pile of longstanding InstCombine bugs caused by a conflict between iterative sinking and InstCombine's infinite-loop-threshold.

I get ~60 regression test failures with that patch, but regenerating the first 5 or so tests that I tried to update does not show any real regressions.

I've added another variation I encountered to D87479 which another run of LICM will not fix.

Would that case be fixed by disabling sinking in InstCombine? Maybe it's finally time to just drop sinking in InstCombine and deal with any fallout.

At least as a first check, compile-time sees a net improvement:
https://llvm-compile-time-tracker.com/compare.php?from=d87468e56c221555be7a1c1550036af5b7034896&to=a5691dd50d4098dec45571fcdfdc5403826a07de&stat=instructions:u

That change should also allow us to close a pile of longstanding InstCombine bugs caused by a conflict between iterative sinking and InstCombine's infinite-loop-threshold.

I get ~60 regression test failures with that patch, but regenerating the first 5 or so tests that I tried to update does not show any real regressions.

I don't think my new variation as anything to do with the sinking code in InstCombine. It's being caused by the same transform D87479 touched. That code creates an fmul followed by fdiv at the location of the original fmul. Even if the original fdiv was not in the same basic block or loop as the fmul. LICM will hoist a reciprocal out later, but that leaves an extra fmul.

I've added another variation I encountered to D87479 which another run of LICM will not fix.

Would that case be fixed by disabling sinking in InstCombine? Maybe it's finally time to just drop sinking in InstCombine and deal with any fallout.

At least as a first check, compile-time sees a net improvement:
https://llvm-compile-time-tracker.com/compare.php?from=d87468e56c221555be7a1c1550036af5b7034896&to=a5691dd50d4098dec45571fcdfdc5403826a07de&stat=instructions:u

That change should also allow us to close a pile of longstanding InstCombine bugs caused by a conflict between iterative sinking and InstCombine's infinite-loop-threshold.

I get ~60 regression test failures with that patch, but regenerating the first 5 or so tests that I tried to update does not show any real regressions.

I don't think my new variation as anything to do with the sinking code in InstCombine. It's being caused by the same transform D87479 touched. That code creates an fmul followed by fdiv at the location of the original fmul. Even if the original fdiv was not in the same basic block or loop as the fmul. LICM will hoist a reciprocal out later, but that leaves an extra fmul.

Ah - I see. I'll comment on the other patch then, so I don't clutter up this review about general sinking in InstCombine.

Hi @spatel @craig.topper, I did think about trying to change InstCombine to avoid sinking, but given resistance to this approach in the past I wasn't sure if this was the right approach. However, certainly it's unfortunate that we run InstCombine so many times in the LTO pipeline that we end up with this problem. I agree that adding another LICM pass at the end increases compilation time too, which isn't great. I think ultimately we have to decide between 1) increased compilation time without changing InstCombine, 2) reduce compilation time and increase performance by not sinking the fdiv in InstCombine. I'm not particularly attached to either approach, but I definitely do want to fix it. :)

Depending on the LICM pass being run after InstCombine feels a bit like a game of cat and mouse, where the former hoists out the reciprocal and the latter restores things to the canonical form. I worry that guarding this ordering with just a test is a bit fragile.

There also isn't much precedent for changing InstCombine to not do the simplification, as I only spotted one cases in InstCombine (InstCombinerImpl::visitGEPOfGEP) where it looked at LoopInfo to see if something is loop invariant. Additionally, if having 1/x outside the loop is the canonical form, it still relies on LICM being run first and for LoopInfo to be available in InstCombine (which in my understanding is optional, depending on the passes run before it).

So should this functionality perhaps be moved out of LICM into a more limited pass that is run just before SelectionDAG, or perhaps be made part of CodeGenPrepare?

Depending on the LICM pass being run after InstCombine feels a bit like a game of cat and mouse, where the former hoists out the reciprocal and the latter restores things to the canonical form. I worry that guarding this ordering with just a test is a bit fragile.

Agree.

There also isn't much precedent for changing InstCombine to not do the simplification, as I only spotted one cases in InstCombine (InstCombinerImpl::visitGEPOfGEP) where it looked at LoopInfo to see if something is loop invariant. Additionally, if having 1/x outside the loop is the canonical form, it still relies on LICM being run first and for LoopInfo to be available in InstCombine (which in my understanding is optional, depending on the passes run before it).

Agree again. But we do make some concessions about canonical form when we know an instruction really is going to be expensive on all targets (any type of div/rem).

So should this functionality perhaps be moved out of LICM into a more limited pass that is run just before SelectionDAG, or perhaps be made part of CodeGenPrepare?

CGP is an option, but that's also potentially not reliable (supposed to eventually go away with the switch to GISel?). A refinement of D87479 doesn't seem that bad to me after all. It's a hack, but there's no ideal fix in sight.

Depending on the LICM pass being run after InstCombine feels a bit like a game of cat and mouse, where the former hoists out the reciprocal and the latter restores things to the canonical form. I worry that guarding this ordering with just a test is a bit fragile.

Agree.

There also isn't much precedent for changing InstCombine to not do the simplification, as I only spotted one cases in InstCombine (InstCombinerImpl::visitGEPOfGEP) where it looked at LoopInfo to see if something is loop invariant. Additionally, if having 1/x outside the loop is the canonical form, it still relies on LICM being run first and for LoopInfo to be available in InstCombine (which in my understanding is optional, depending on the passes run before it).

Agree again. But we do make some concessions about canonical form when we know an instruction really is going to be expensive on all targets (any type of div/rem).

So should this functionality perhaps be moved out of LICM into a more limited pass that is run just before SelectionDAG, or perhaps be made part of CodeGenPrepare?

CGP is an option, but that's also potentially not reliable (supposed to eventually go away with the switch to GISel?). A refinement of D87479 doesn't seem that bad to me after all. It's a hack, but there's no ideal fix in sight.

Do you have a proposed refinement for D87479? Turns out LI is only used by InstCombine if its available already in the pipeline so I don't know if it's safe to use that instead of the basic block check.

Do you have a proposed refinement for D87479? Turns out LI is only used by InstCombine if its available already in the pipeline so I don't know if it's safe to use that instead of the basic block check.

Yes, using LoopInfo was what I was thinking. I just made a rough draft of that, and it handles the motivating case here, and I think it would work on the other examples we've raised too. I can clean it up, test some more, and post it.

Do you have a proposed refinement for D87479? Turns out LI is only used by InstCombine if its available already in the pipeline so I don't know if it's safe to use that instead of the basic block check.

Yes, using LoopInfo was what I was thinking. I just made a rough draft of that, and it handles the motivating case here, and I think it would work on the other examples we've raised too. I can clean it up, test some more, and post it.

I don't think it it will work on the case I raised since IR LICM can't fix it so the damage will happen the first time InstCombine runs without LoopInfo. But I hope I'm wrong.

Do you have a proposed refinement for D87479? Turns out LI is only used by InstCombine if its available already in the pipeline so I don't know if it's safe to use that instead of the basic block check.

Yes, using LoopInfo was what I was thinking. I just made a rough draft of that, and it handles the motivating case here, and I think it would work on the other examples we've raised too. I can clean it up, test some more, and post it.

I don't think it it will work on the case I raised since IR LICM can't fix it so the damage will happen the first time InstCombine runs without LoopInfo. But I hope I'm wrong.

You're correct - it seems that LICM isn't able to do this in general:

loop:
  m1 = loopInvariant1 * x;
  m2 = loopInvariant2 * m1;
-->
loopInvariantMul = loopInvariant1 * loopInvariant2;
loop:
  m = x * loopInvariantMul;

I don't know much about LICM - should it be able to do that transform? Or is some other pass responsible for the math reassociation?
This doesn't seem to be fdiv-specific. Here's an example with fadd/fmul; there should only be 1 invariant fadd in the loop:
https://godbolt.org/z/xrs9xfGrf

david-arm updated this revision to Diff 535766.Jun 29 2023, 6:49 AM
david-arm edited the summary of this revision. (Show Details)
  • Rebase
david-arm edited reviewers, added: paulwalker-arm, nikic; removed: spatel.Jun 29 2023, 6:54 AM
  • Gentle ping! After discussions on an alternative approach (D144045), I think this solution is the correct way forward. It would be great if we could fix this issue for LLVM 17!
nikic added inline comments.Jun 29 2023, 6:59 AM
llvm/lib/Passes/PassBuilderPipelines.cpp
1247–1248

It would be better to move these InstCombine + LICM passes out of the if and drop the InstCombine below.

david-arm updated this revision to Diff 536791.Jul 3 2023, 9:11 AM
david-arm added a subscriber: hassnaa-arm.
  • Restructured the passes according to review comments.
david-arm marked an inline comment as done.Jul 3 2023, 9:11 AM
paulwalker-arm accepted this revision.Jul 3 2023, 12:31 PM

Looks good but the title could do with rewording because the patch doesn't actually change instcombine.

This revision is now accepted and ready to land.Jul 3 2023, 12:31 PM

Compile-time: http://llvm-compile-time-tracker.com/compare.php?from=6c5ba7cddfa2210e9ddb12ab6f016b84db9a8b23&to=ef44d3f482b284ee11a5335d8c714f7eceb3dea7&stat=instructions:u

Post-link numbers are:

kc.link 	 	30147M 	30326M (+0.59%)
sqlite3.link 	 	40914M 	41303M (+0.95%)
consumer-typeset.link 	36935M 	37405M (+1.27%)
bullet.link 	 	27582M 	27914M (+1.20%)
tramp3d-v4.link 	137504M 139271M (+1.29%)
pairlocalalign.link 	19476M 	19686M (+1.08%)
clamscan.link 	 	56907M 	57439M (+0.93%)
lencod.link 	 	102321M 102803M (+0.47%)
SPASS.link 	 	51517M 	52669M (+2.24%)
7zip-benchmark.link 	106003M 105851M (-0.14%)

Despite that, I'm on board with this change.

llvm/lib/Passes/PassBuilderPipelines.cpp
1261

UseBlockFrequencyInfo should stay false.

david-arm updated this revision to Diff 537382.Jul 5 2023, 9:08 AM
david-arm retitled this revision from [LTO] Don't let InstCombine re-sink the vastly more expensive fdiv to [LTO] Ensure LICM hoists expensive fdiv instructions introduced by InstCombine.
david-arm marked an inline comment as done.

Looks good but the title could do with rewording because the patch doesn't actually change instcombine.

Hopefully the new title is better now!

This revision was landed with ongoing or failed builds.Jul 7 2023, 4:06 AM
This revision was automatically updated to reflect the committed changes.