This is an archive of the discontinued LLVM Phabricator instance.

[LICM] Make Loop ICM profile aware again
ClosedPublic

Authored by modimo on Sep 11 2020, 3:33 PM.

Details

Summary

D65060 was reverted because it introduced non-determinism by using BFI counts from already freed blocks. The parent of this revision fixes that by using a VH callback on blocks to prevent this from happening and makes sure BFI data is passed correctly in LoopStandardAnalysisResults.

This re-introduces the previous optimization of using BFI data to prevent LICM from hoisting/sinking if the instruction will end up moving to a colder block.

Internally at Facebook this change results in a ~7% win in a CPU related metric in one of our big services by preventing hoisting cold code into a hot pre-header like the added test case demonstrates.

Testing:
ninja check

Diff Detail

Event Timeline

modimo created this revision.Sep 11 2020, 3:33 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 11 2020, 3:33 PM
modimo requested review of this revision.Sep 11 2020, 3:33 PM
asbirlea accepted this revision.Sep 14 2020, 2:24 PM

Thank you for the patch!

Could you include in the description the benefits seen from adding this, as a data point?

llvm/include/llvm/Transforms/Utils/LoopUtils.h
127

Nit: update comment to include arguments and not duplicate "information for all".

This revision is now accepted and ready to land.Sep 14 2020, 2:24 PM
modimo updated this revision to Diff 291692.Sep 14 2020, 3:11 PM
modimo edited the summary of this revision. (Show Details)

Fix merge issues with comment, updated description to include benefit we see from this change. Thanks @asbirlea for the quick review!

modimo updated this revision to Diff 291702.Sep 14 2020, 3:15 PM
modimo edited the summary of this revision. (Show Details)

Remove unnecessary period

This revision was automatically updated to reflect the committed changes.

Committed on behalf of @modimo.

reames added a subscriber: reames.Oct 14 2021, 2:14 PM

I'm sorry for coming to this very late, but I just noticed this patch due to discussion on another patch (D111806).

This change is inconsistent with the long standing design of LICM. LICM is a canonicalizing transform, and we specifically *do not* use any profitability metric therein. MachineLICM exists specifically to be the inverse transformation which is cost based. I see nothing in the example to make it clear why MachineLICM could not handle this case.

As evidenced by the very existence of the later patch, having this in tree sets a bad precedent. I believe this patch should be reverted.

Given it's been in tree for nearly a year, this clearly isn't immediately urgent. Given that, I'm going to give a couple of days for discussion before reverting this patch.

I'm sorry for coming to this very late, but I just noticed this patch due to discussion on another patch (D111806).

This change is inconsistent with the long standing design of LICM. LICM is a canonicalizing transform, and we specifically *do not* use any profitability metric therein. MachineLICM exists specifically to be the inverse transformation which is cost based. I see nothing in the example to make it clear why MachineLICM could not handle this case.

As evidenced by the very existence of the later patch, having this in tree sets a bad precedent. I believe this patch should be reverted.

Given it's been in tree for nearly a year, this clearly isn't immediately urgent. Given that, I'm going to give a couple of days for discussion before reverting this patch.

FWIW, i agree with Philip on this.

While i'm sure that this, and many other, transform are easy wins,
they fundamentally do not fit the design of the pass(es). Creating initial divergence
opens a door for further changes of the same nature, and it's increasingly more
and more complicated to keep the door shut further down the line. The other example
is e.g. instcombine. We just can not afford to have a single canonical representation
except not really when $X || $Y || $z.

This must be rolled back.

reames added a comment.Dec 3 2021, 5:11 PM

I will be reverting this change in the near future. See previously expressed comments on this review, and the recent llvm-dev thread titled "LICM as canonical form".

Note that I confirmed that both test cases checked in with this change are handled by LoopSink.

wenlei added a comment.Jun 8 2023, 8:05 PM

This revert falls through the cracks, but recently we saw a 3% perf regression on a very large server workload due to compiler upgrade, and we tracked it down to this revert. It triggered investigation again, and I read through the comments on patches and "LICM as canonical form" thread - thanks for following up and the clarifications @reames.

It is perhaps an extreme case that we're looking at - we have a 400+ entry cold jump table (from switch) inside a loop, and hoisting defs from each of the 400+ jump targets all to loop pre-header killed the performance. The defs from the 400+ jump table targets all feed into a PHI-use afterwards, and that PHI is what blocked loop sink.

So it comes down to PHI uses. Loop sink currently blocks sinking to PHI-use: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LoopSink.cpp#L181-L182. I think we can see through PHI and sink to the incoming block of the value being used - I gave it a try and that seems to address our problem. Testing on a few internal components showed that removing the limitation on sinking to PHI sometimes allows 2x more loop sink to happen. I will send up a patch.

Herald added a project: Restricted Project. · View Herald TranscriptJun 8 2023, 8:05 PM