This is an archive of the discontinued LLVM Phabricator instance.

Use frequency info to guide Loop Invariant Code Motion.
AbandonedPublic

Authored by danielcdh on May 4 2016, 6:24 PM.

Details

Summary

If an instruction's frequency is smaller than preheader, it will harm performance to hoist the instruction to the preheader.

Diff Detail

Event Timeline

danielcdh updated this revision to Diff 56229.May 4 2016, 6:24 PM
danielcdh retitled this revision from to Use frequency info to guide Loop Invariant Code Motion..
danielcdh updated this object.
danielcdh added reviewers: hfinkel, davidxl, majnemer.
danielcdh added a subscriber: llvm-commits.

Do you have any performance data to back your claim?

lib/Transforms/Scalar/LICM.cpp
347

This could be simplified to:

bool ShouldHoist = BFI->getBlockFreq(BB) >= BFI->getBlockFreq(CurLoop->getLoopPreheader();

and should probably be places inside the if (!inSubLoop(BB, CurLoop, LI)) scope. I.e.,

if (!inSubLoop(BB, CurLoop, LI)) {
  bool ShouldHoist = BFI->getBlockFreq(BB) >= BFI->getBlockFreq(CurLoop->getLoopPreheader();
  for (...)
    ...
}
test/Other/pass-pipelines.ll
40

I may be missing something here, but is this test even relevant to this patch?

mcrosier added inline comments.May 5 2016, 5:58 AM
test/Other/pass-pipelines.ll
40

I guess this just shows the BPA/BFA is being added to the pipeline? We also need a test in test/Transforms/LICM..

junbuml added a subscriber: junbuml.May 5 2016, 6:58 AM
davidxl edited edge metadata.May 5 2016, 8:42 AM

A test case should be added to demonstrate the issue.

test/Other/pass-pipelines.ll
40

BPI is required by BFI

danielcdh updated this revision to Diff 56324.May 5 2016, 12:08 PM
danielcdh edited edge metadata.

Add a unittest to cover the added logic.

The test case can be further simplified .. (e.g, by disabing licm with O2 -emit-llvm -S). Also make x a local variable and just test 'g', let function return x.

danielcdh updated this revision to Diff 56465.May 6 2016, 2:34 PM

reduce the unittest.

davidxl added inline comments.May 6 2016, 9:15 PM
lib/Transforms/Scalar/LICM.cpp
353–355

Better to do this:

if (ShouldHoist && !inSubLoop(BB, CurLoop, LI))

Chad's suggestion of doing

ShouldHoist = BFI->getBlockFreq(BB) >= .. is also good.

danielcdh added inline comments.May 9 2016, 10:12 AM
lib/Transforms/Scalar/LICM.cpp
353–355

Updated with Chad's suggestion. But I cannot move the ShouldHoist check outside the loop because there is still constant folding opportunities inside the loop.

danielcdh updated this revision to Diff 56593.May 9 2016, 10:12 AM

Move ShouldHoist inside the branch.

junbuml added inline comments.May 9 2016, 10:16 AM
include/llvm/Transforms/Utils/LoopUtils.h
369–377

Please update the comment.

lib/Transforms/Scalar/LICM.cpp
353–356

I agree that we still need to perform constant folding even when ShouldHoist is false?

danielcdh updated this revision to Diff 56594.May 9 2016, 10:24 AM

update comment

danielcdh marked an inline comment as done.May 9 2016, 10:25 AM
danielcdh added inline comments.
lib/Transforms/Scalar/LICM.cpp
353–356

yes

My concern is that static prediction can certainly get it wrong leading to performance regression. I suggest only check this when real profile data is available.

hfinkel edited edge metadata.May 9 2016, 4:24 PM

My concern is that static prediction can certainly get it wrong leading to performance regression. I suggest only check this when real profile data is available.

Why is this any more of a concern here vs. anywhere else we use static heuristics for block frequencies? I'd think that any problems in this regard should be corrected by making our static heuristics more conservative (i.e. weaker), no?

Static prediction has been conservative in estimating loop trip count -- it produces something like 30ish iterations. If the a very hot loop has a big if-then-else (or switch), it is very likely to mark many bbs' to be colder than the loop header. Turning on this for static prediction really depends on the false rate. It seems to be this can get wrong pretty easily for very hot loops (which is also the most important thing to optimize for).

Static prediction has been conservative in estimating loop trip count -- it produces something like 30ish iterations. If the a very hot loop has a big if-then-else (or switch), it is very likely to mark many bbs' to be colder than the loop header. Turning on this for static prediction really depends on the false rate. It seems to be this can get wrong pretty easily for very hot loops (which is also the most important thing to optimize for).

This is a good point. There's no universal conservative choice (assuming a small trip count is conservative in some cases, and assuming a large trip count is conservative in other cases).

Static prediction has been conservative in estimating loop trip count -- it produces something like 30ish iterations. If the a very hot loop has a big if-then-else (or switch), it is very likely to mark many bbs' to be colder than the loop header. Turning on this for static prediction really depends on the false rate. It seems to be this can get wrong pretty easily for very hot loops (which is also the most important thing to optimize for).

This is a good point. There's no universal conservative choice (assuming a small trip count is conservative in some cases, and assuming a large trip count is conservative in other cases).

Would it be better (and practical) if there were some way for the BFI client to specify which kind of 'conservative' is desired?

Also, why are we doing this instead of sinking later (in CGP or similar)? LICM can expose optimization opportunities, plus represents a code pattern the user might input manually. Sinking later seems more robust.

Static prediction has been conservative in estimating loop trip count -- it produces something like 30ish iterations. If the a very hot loop has a big if-then-else (or switch), it is very likely to mark many bbs' to be colder than the loop header. Turning on this for static prediction really depends on the false rate. It seems to be this can get wrong pretty easily for very hot loops (which is also the most important thing to optimize for).

This is a good point. There's no universal conservative choice (assuming a small trip count is conservative in some cases, and assuming a large trip count is conservative in other cases).

Would it be better (and practical) if there were some way for the BFI client to specify which kind of 'conservative' is desired?

Also, why are we doing this instead of sinking later (in CGP or similar)? LICM can expose optimization opportunities, plus represents a code pattern the user might input manually. Sinking later seems more robust.

I looked at CGP pass, looks like it's handling the sinking case-by-case (e.g. there is separate routine to handle sinking of load, gep, etc. I'm afraid this would miss opportunities. Additionally, the file-level comment of CGP pass says "This works around limitations in it's basic-block-at-a-time approach. It should eventually be removed."

I'm not quite clear why it helps to move code out of loop early and later sink it inside. Could you give an example or some more context?

Thanks,
Dehao