Page MenuHomePhabricator

[BPI] Improve static heuristics for "cold" paths.
Needs ReviewPublic

Authored by ebrevnov on May 6 2020, 5:36 AM.



Current approach doesn't work well in cases when multiple paths are predicted to be "cold". By "cold" paths I mean those containing "unreachable" instruction, call marked with 'cold' attribute and 'unwind' handler of 'invoke' instruction. The issue is that heuristics are applied one by one until the first match and essentially ignores relative hotness/coldness
of other paths.

New approach unifies processing of "cold" paths by assigning predefined absolute weight to each block estimated to be "cold". Then we propagate these weights up/down IR similarly to existing approach. And finally set up edge probabilities based on estimated block weights.

One important difference is how we propagate weight up. Existing approach propagates the same weight to all blocks that are post-dominated by a block with some "known" weight. This is useless at least because it always gives 50\50 distribution which is assumed by default anyway. Worse, it causes the algorithm to skip further heuristics and can miss setting more accurate probability. New algorithm propagates the weight up only to the blocks that dominates and post-dominated by a block with some "known" weight. In other words, those blocks that are either always executed or not executed together.

In addition new approach processes loops in an uniform way as well. Essentially loop exit edges are estimated as "cold" paths relative to back edges and should be considered uniformly with other coldness/hotness markers.

Diff Detail

Unit TestsFailed

60 mswindows > LLVM.DebugInfo/X86::addr-tu-to-non-tu.ll
Script: -- : 'RUN: at line 1'; c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\llc.exe -filetype=obj -O0 -generate-type-units -split-dwarf-file=x.dwo < C:\ws\w16n2-1\llvm-project\premerge-checks\llvm\test\DebugInfo\X86\addr-tu-to-non-tu.ll | c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\llvm-dwarfdump.exe -debug-info -debug-types - | c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\filecheck.exe --implicit-check-not=Unit --implicit-check-not=contents --implicit-check-not=declaration C:\ws\w16n2-1\llvm-project\premerge-checks\llvm\test\DebugInfo\X86\addr-tu-to-non-tu.ll

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ebrevnov added a comment.EditedJul 24 2020, 5:01 AM

I have moved part of the change into a separate review D84514. Let's start with that...

Marked as WIP since I'm going to extract one more part out from this review.

ebrevnov retitled this revision from [BPI] Improve static heuristics for "cold" paths. to [WIP][BPI] Improve static heuristics for "cold" paths..Jul 28 2020, 6:10 AM
ebrevnov updated this revision to Diff 282186.Jul 31 2020, 4:11 AM

Rebased on top of D84838

ebrevnov added inline comments.Jul 31 2020, 4:27 AM

Short answer is to preserve same branch probability as before.
Old implementation assigned CC_NONTAKEN_WEIGHT weight to "normal" path and CC_TAKEN_WEIGHT to "cold" path. Absolute values are irrelevant and ratio CC_NONTAKEN_WEIGHT/CC_TAKEN_WEIGHT = 16 defines relative probability of "normal" and "cold" branches. New implementation uses pre-selected weight for all "nornal" paths DEFAULT_WEIGHT. Thus relative ratio is calculated as DEFAULT_WEIGHT/COLD_WEIGHT = 0xfffff/0xffff = 16.

ebrevnov retitled this revision from [WIP][BPI] Improve static heuristics for "cold" paths. to [BPI] Improve static heuristics for "cold" paths..Jul 31 2020, 4:27 AM

It's a bit shorter now. Let's see if it's manageable now. Further reduction would be problematic.

A lot of test changes can probably be extracted out as NFC (to make the string test more robust) -- this will reduce the size of the patch.

Do you have performance numbers related to the change?


Why not define a value for DEFAULT_WEIGHT, and define COLD_WEIGHT to be 1/16 of the DEFAULT weight? It is more readable that way.


document the method and params.


explain here. Why the early estimated weight takes precedence?


why doing RPO walk? the algorithm does not seem to depend on this order.


why is the new estimation better?

A lot of test changes can probably be extracted out as NFC (to make the string test more robust) -- this will reduce the size of the patch.

Unfortunately, in most cases, I don't see how this can be done with out loosing current coverage. There are few cases like 'successors: %bb.3(0x80000000), %bb.4(0x00000000)' where we could remove exact numbers and just check block names. I think we better keep exact probabilities in all other cases. That actually helps to catch unintended changes.

Do you have performance numbers related to the change?

I measured performance on bunch of java related benchmarks we have in house including SPECJVM2008, SPECJbb2015, DaCapo9.12 and others. I don't see any noticeable impact on performance.


This is just the way how weights were defined before (using direct numbers instead of ratio). Please note all other cases (like ZH_TAKEN_WEIGHT, ZH_NONTAKEN_WEIGHT) don't use ratio as well. Another theoretical reason could be an ability to represent ratios with non zero remainder.




will add a comment.


You are right and it should not affect correctness but could cause the algorithm to do more iterations. 'While' loop following this one propagates weights from successors to predecessors. If any of the successors not yet evaluated it will trigger an additional iteration when get evaluated. That means it's preferred to evaluate successors before predecessors.


This is tough question. Let's take an extreme case when loop has infinite number of exits. In this case probability to take back branch should go to zero and loop frequency should be 1.0. That means more exits we have less probable to got to the next iteration.

Existing algorithm completely ignores number of exits and always assumes loop makes 32 iteration. New algorithm assumes that back branch weight is 32 times higher than default exiting weight. Thus in case of one loop exit both algorithms give the following probability of the back branch:

edge loop -> loop probability is 0x7c000000 / 0x80000000 = 96.88% [HOT edge]

But because in this case there are two exiting edges probability of the back branch is calculated as '32*DW/(32*DW+2*DW)==32/34=0.94"

edge loop -> loop probability is 0x783e0f84 / 0x80000000 = 93.94%

I think this is pretty reasonable probability much better matching general high level picture.

There is one cavity though. Please note in this example we have two exiting branches from latch. If we had side exit from another block even new algorithm won't take it into account. I think we should put a TODO on that and follow up later. What do you think?

ebrevnov added inline comments.Aug 11 2020, 11:46 PM


In fact, there is a documentation in header file. In source file I can put more details on behavior in case of rewrite attempt as you requested bellow.

ebrevnov updated this revision to Diff 284984.Aug 12 2020, 12:04 AM

Added a comment to updateEstimatedBlockWeight

wenlei added a subscriber: hoyFB.Aug 12 2020, 9:58 PM
wenlei added a subscriber: wenlei.

Is this work based on any paper/implementation?
Can we add some documentation at the top of the file to get an overall idea of the cost model?


nit: inconsistent comments '' vs '/'


do we have a tab here?

This comment was removed by xbolva00.

Is this work based on any paper/implementation?

I don't propose anything completely new here. New algorithm does essentially the same as existing one but

  1. fixes several correctness issues
  2. extends the approach to handle loops and invokes in a universal way

Can we add some documentation at the top of the file to get an overall idea of the cost model?

I don't think "cost model" is applicable here....anyway I can try to describe how thing work using some example. Is this what you are looking for?


Will remove extra /


No, I don't have tabs in the code. Looks like phabricator represents indention increase by 4 this way.

..anyway I can try to describe how thing work using some example. Is this what you are looking for?

yes. Thank you.

please run clang-format.

ebrevnov updated this revision to Diff 286234.Aug 18 2020, 3:03 AM

Added description of the algorithm to the header + formatting.

please run clang-format.


..anyway I can try to describe how thing work using some example. Is this what you are looking for?

yes. Thank you.


Is the test failure related to this patch?

Is the test failure related to this patch?

I doubt it is. But I will double check before commit.

ebrevnov edited the summary of this revision. (Show Details)Sep 16 2020, 10:19 PM

I'm still reading. The patch is long but I'd like to avoid splitting it into pieces that might result in massive tests changes (back and forth). Now the test changes seem to be compact.
It would be great if someone could join the reviewing efforts.


agree, the simple inplace definition would simplify the reading.




... then sets ... and returns true.


change: ... all blocks/loops potentially affected by ...
to: ... all blocks/loops that might need their weight re-estimated after ...
or something similar in meaning.


... propagates ...


... sets ...


may be rename to LOWEST_NON_ZERO_WEIGHT or ULP_WEIGHT to explicitly denote its meaning?



  1. group the new related *_WEIGHT into one enum (to distinguish them from the other *_WEIGH constants)
  2. explicitly describe their values (e.g. why COLD_WEIGHT is X times as high as LOWEST_NONZERO_WEIGHT)
  3. avoid using hex notation unless it makes sense (e.g. important for bitwise operations)

Something like the following:

enum {

I suggest that we introduce the UNKNOWN_WEIGHT for most of the DEFAULT_WEIGHT uses to better reflect its meaning and the way it is treated.
DEFAULT_WEIGHT should be used only for UNKNOWN_WEIGHT at the very last step of BPI calculation (after propagation).


could be defined along with its declaration for ease of reading


could be defined along with its declaration for ease of reading


could be defined along with its declaration for ease of reading


I would remove this doc from the definition (to not deviate from the doc at the declaration) and put the note inside the body.


early return would be easier to read and would not need std::tie and std::ignore:

if (!EstimatedBlockWeight.insert({BB, BBWeight}).second)
  return false;
return true;

the idiom:

for (BasicBlock *P : predecessors(BB))

please explain (what if previous weight is not equal to the new, may be assert(PrevWeight <= BBWeight)?)


change TC to trip count


rename WorkList to BlockWorkList as it is in computeEstimatedBlockWeight().


could it result in a deep recursion? may be a worklist would be better?






could be renamed to estimateBlockWeights() (plural)


'is never returns' sounds strange. Not sure but may be IsNeverReturn? IsNeverReturning? IsDeadEnd?
The function is used only once and only with blocks of two kinds:

  1. blocks that have their terminator instruction preceded with a deoptimize call. So the terminator must be the ret instruction.
  2. blocks with unreachable terminator instruction.

Both cases imply no block successors. So it is better to have
assert(BB->getTerminator()->getNumSuccessors() == 0)
Otherwise (for a generic function) there could be a question: why this condition have a lower priority than a call with Attribute::NoReturn?

call @foo() noreturn
br label %next

if we extracted a lambda estimateBlockWeight (which is worth to be a separate member function with its description) then the structure would be concise:

auto estimateBlockWight = [&](const BasicBlock *BB) -> Optional<uint32_t>{
  if (isa<UnreachableInst>(BB->getTerminator()) || BB->getTerminatingDeoptimizeCall())

  for (const auto *Pred : predecessors(BB))
    if (const auto *II = dyn_cast<InvokeInst>(Pred->getTerminator()))
      if (II->getUnwindDest() == BB)
        return UNWIND_WEIGHT;

  for (const auto &I : *BB)
    if (const CallInst *CI = dyn_cast<CallInst>(&I))
      if (CI->hasFnAttr(Attribute::Cold))
        return COLD_WEIGHT;

  return None;

ReversePostOrderTraversal<const Function *> RPOT(&F);
for (const auto *BB : RPOT)
  if (auto BBWeight = estimateBlockWight(BB))
    propagateEstimatedBlockWeight(getLoopBlock(BB), DT, PDT,
                                  BBWeight.getValue(), BlockWorkList,

not needed?


remove braces { .. }

Please fix clang-format issues.