If an instruction's frequency is smaller than preheader, it will harm performance to hoist the instruction to the preheader.
Diff Detail
Event Timeline
Do you have any performance data to back your claim?
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
347 | This could be simplified to: bool ShouldHoist = BFI->getBlockFreq(BB) >= BFI->getBlockFreq(CurLoop->getLoopPreheader(); and should probably be places inside the if (!inSubLoop(BB, CurLoop, LI)) scope. I.e., if (!inSubLoop(BB, CurLoop, LI)) { bool ShouldHoist = BFI->getBlockFreq(BB) >= BFI->getBlockFreq(CurLoop->getLoopPreheader(); for (...) ... } | |
test/Other/pass-pipelines.ll | ||
40 | I may be missing something here, but is this test even relevant to this patch? |
test/Other/pass-pipelines.ll | ||
---|---|---|
40 | I guess this just shows the BPA/BFA is being added to the pipeline? We also need a test in test/Transforms/LICM.. |
A test case should be added to demonstrate the issue.
test/Other/pass-pipelines.ll | ||
---|---|---|
40 | BPI is required by BFI |
The test case can be further simplified .. (e.g, by disabing licm with O2 -emit-llvm -S). Also make x a local variable and just test 'g', let function return x.
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
353–355 | Better to do this: if (ShouldHoist && !inSubLoop(BB, CurLoop, LI)) Chad's suggestion of doing ShouldHoist = BFI->getBlockFreq(BB) >= .. is also good. |
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
353–355 | Updated with Chad's suggestion. But I cannot move the ShouldHoist check outside the loop because there is still constant folding opportunities inside the loop. |
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
353–356 | yes |
My concern is that static prediction can certainly get it wrong leading to performance regression. I suggest only check this when real profile data is available.
Why is this any more of a concern here vs. anywhere else we use static heuristics for block frequencies? I'd think that any problems in this regard should be corrected by making our static heuristics more conservative (i.e. weaker), no?
Static prediction has been conservative in estimating loop trip count -- it produces something like 30ish iterations. If the a very hot loop has a big if-then-else (or switch), it is very likely to mark many bbs' to be colder than the loop header. Turning on this for static prediction really depends on the false rate. It seems to be this can get wrong pretty easily for very hot loops (which is also the most important thing to optimize for).
This is a good point. There's no universal conservative choice (assuming a small trip count is conservative in some cases, and assuming a large trip count is conservative in other cases).
Would it be better (and practical) if there were some way for the BFI client to specify which kind of 'conservative' is desired?
Also, why are we doing this instead of sinking later (in CGP or similar)? LICM can expose optimization opportunities, plus represents a code pattern the user might input manually. Sinking later seems more robust.
I looked at CGP pass, looks like it's handling the sinking case-by-case (e.g. there is separate routine to handle sinking of load, gep, etc. I'm afraid this would miss opportunities. Additionally, the file-level comment of CGP pass says "This works around limitations in it's basic-block-at-a-time approach. It should eventually be removed."
I'm not quite clear why it helps to move code out of loop early and later sink it inside. Could you give an example or some more context?
Thanks,
Dehao
Please update the comment.