This is an archive of the discontinued LLVM Phabricator instance.

[BPI] Improve static heuristics for "cold" paths.
ClosedPublic

Authored by ebrevnov on May 6 2020, 5:36 AM.

Details

Summary

Current approach doesn't work well in cases when multiple paths are predicted to be "cold". By "cold" paths I mean those containing "unreachable" instruction, call marked with 'cold' attribute and 'unwind' handler of 'invoke' instruction. The issue is that heuristics are applied one by one until the first match and essentially ignores relative hotness/coldness
of other paths.

New approach unifies processing of "cold" paths by assigning predefined absolute weight to each block estimated to be "cold". Then we propagate these weights up/down IR similarly to existing approach. And finally set up edge probabilities based on estimated block weights.

One important difference is how we propagate weight up. Existing approach propagates the same weight to all blocks that are post-dominated by a block with some "known" weight. This is useless at least because it always gives 50\50 distribution which is assumed by default anyway. Worse, it causes the algorithm to skip further heuristics and can miss setting more accurate probability. New algorithm propagates the weight up only to the blocks that dominates and post-dominated by a block with some "known" weight. In other words, those blocks that are either always executed or not executed together.

In addition new approach processes loops in an uniform way as well. Essentially loop exit edges are estimated as "cold" paths relative to back edges and should be considered uniformly with other coldness/hotness markers.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
john.brawn added inline comments.Jun 3 2020, 10:57 AM
llvm/include/llvm/Analysis/BranchProbabilityInfo.h
337

What's TODO here (and similar for the TODOs below)?

368–412

Would it be possible to move all of these is/getSCC functions into SccInfo, to avoid cluttering up BranchProbabilityInfo?

370

Probably calculateSccBlockInfo would be better here.

llvm/lib/Analysis/BranchProbabilityInfo.cpp
853

Weight, not Weigth

867

This is making sure that the divide isn't rounding down to zero, yes? In which case I don't think it'll correctly handle the case where getEstimatedEdgeWeight returns ZERO_WEIGHT.

ebrevnov updated this revision to Diff 273316.Jun 25 2020, 5:48 AM
ebrevnov marked 6 inline comments as done.

Fixes according to comments + rebase

ebrevnov marked an inline comment as done.Jun 25 2020, 5:54 AM

Hi @john.brawn,

Sorry for keeping silence for that long. I had to work on higher priority tasks before I got a chance to return back to this. I think I fixed all your remarks. Please take a look.

Thanks
Evgeniy

ebrevnov updated this revision to Diff 273323.Jun 25 2020, 5:59 AM

Minor update

Can you provide some performace data (SPEC/llvm suite) ?

Can you provide some performace data (SPEC/llvm suite) ?

By SPEC I guess you mean SPEC CPU2006/2017, right? Unfortunately, I don't have access to this benchmark. Would be nice if somebody could check it for me. On my side I can check SPEC JVM2008 and some other java benchmarks.

Regarding LLVM suite. I tried running it about half year ago but got negative experience. The results were not consistent and showed big variation. I can give one more try though. Any advice on running it is highly appreciated.

ebrevnov updated this revision to Diff 273603.Jun 26 2020, 12:31 AM

Formatting

Can you provide some performace data (SPEC/llvm suite) ?

By SPEC I guess you mean SPEC CPU2006/2017, right? Unfortunately, I don't have access to this benchmark. Would be nice if somebody could check it for me. On my side I can check SPEC JVM2008 and some other java benchmarks.

Regarding LLVM suite. I tried running it about half year ago but got negative experience. The results were not consistent and showed big variation. I can give one more try though. Any advice on running it is highly appreciated.

Yeah, any benchmark data would be great.

I measured performance on bunch of java related benchmarks we have in house including SPECJVM2008, SPECJbb2015, DaCapo9.12 and others. I don't see any noticeable impact on performance. Please let me know if you need more details on these experiments or some additional perf data.

This patch is a little large to review properly. Is there any major refactoring (NFC or mostly NFC) that can be separated out?

llvm/lib/Analysis/BranchProbabilityInfo.cpp
120

why is it set to 0xffff?

This patch is a little large to review properly. Is there any major refactoring (NFC or mostly NFC) that can be separated out?

I think I should be able to extract one or two NFCs out of this. Let me do that.

The PGOUse pass can choose not to annotate any branches with total weights == 0. Now the question becomes how do we tell PGOUse pass whether the entry should be set to 0 or leave it not set. There are two ways to do it (to signal it is not really cold, but unknown):

  1. Remove the function from the indexed format profile;
  2. set all counts to some sentinel value such as -1.
ebrevnov added a comment.EditedJul 24 2020, 5:01 AM

I have moved part of the change into a separate review D84514. Let's start with that...

Marked as WIP since I'm going to extract one more part out from this review.

ebrevnov retitled this revision from [BPI] Improve static heuristics for "cold" paths. to [WIP][BPI] Improve static heuristics for "cold" paths..Jul 28 2020, 6:10 AM
ebrevnov updated this revision to Diff 282186.Jul 31 2020, 4:11 AM

Rebased on top of D84838

ebrevnov added inline comments.Jul 31 2020, 4:27 AM
llvm/lib/Analysis/BranchProbabilityInfo.cpp
120

Short answer is to preserve same branch probability as before.
Old implementation assigned CC_NONTAKEN_WEIGHT weight to "normal" path and CC_TAKEN_WEIGHT to "cold" path. Absolute values are irrelevant and ratio CC_NONTAKEN_WEIGHT/CC_TAKEN_WEIGHT = 16 defines relative probability of "normal" and "cold" branches. New implementation uses pre-selected weight for all "nornal" paths DEFAULT_WEIGHT. Thus relative ratio is calculated as DEFAULT_WEIGHT/COLD_WEIGHT = 0xfffff/0xffff = 16.

ebrevnov retitled this revision from [WIP][BPI] Improve static heuristics for "cold" paths. to [BPI] Improve static heuristics for "cold" paths..Jul 31 2020, 4:27 AM

It's a bit shorter now. Let's see if it's manageable now. Further reduction would be problematic.

A lot of test changes can probably be extracted out as NFC (to make the string test more robust) -- this will reduce the size of the patch.

Do you have performance numbers related to the change?

llvm/lib/Analysis/BranchProbabilityInfo.cpp
120

Why not define a value for DEFAULT_WEIGHT, and define COLD_WEIGHT to be 1/16 of the DEFAULT weight? It is more readable that way.

719

document the method and params.

727

explain here. Why the early estimated weight takes precedence?

757

why doing RPO walk? the algorithm does not seem to depend on this order.

llvm/test/Analysis/BlockFrequencyInfo/redundant_edges.ll
12

why is the new estimation better?

A lot of test changes can probably be extracted out as NFC (to make the string test more robust) -- this will reduce the size of the patch.

Unfortunately, in most cases, I don't see how this can be done with out loosing current coverage. There are few cases like 'successors: %bb.3(0x80000000), %bb.4(0x00000000)' where we could remove exact numbers and just check block names. I think we better keep exact probabilities in all other cases. That actually helps to catch unintended changes.

Do you have performance numbers related to the change?

I measured performance on bunch of java related benchmarks we have in house including SPECJVM2008, SPECJbb2015, DaCapo9.12 and others. I don't see any noticeable impact on performance.

llvm/lib/Analysis/BranchProbabilityInfo.cpp
120

This is just the way how weights were defined before (using direct numbers instead of ratio). Please note all other cases (like ZH_TAKEN_WEIGHT, ZH_NONTAKEN_WEIGHT) don't use ratio as well. Another theoretical reason could be an ability to represent ratios with non zero remainder.

719

sure

727

will add a comment.

757

You are right and it should not affect correctness but could cause the algorithm to do more iterations. 'While' loop following this one propagates weights from successors to predecessors. If any of the successors not yet evaluated it will trigger an additional iteration when get evaluated. That means it's preferred to evaluate successors before predecessors.

llvm/test/Analysis/BlockFrequencyInfo/redundant_edges.ll
12

This is tough question. Let's take an extreme case when loop has infinite number of exits. In this case probability to take back branch should go to zero and loop frequency should be 1.0. That means more exits we have less probable to got to the next iteration.

Existing algorithm completely ignores number of exits and always assumes loop makes 32 iteration. New algorithm assumes that back branch weight is 32 times higher than default exiting weight. Thus in case of one loop exit both algorithms give the following probability of the back branch:

edge loop -> loop probability is 0x7c000000 / 0x80000000 = 96.88% [HOT edge]

But because in this case there are two exiting edges probability of the back branch is calculated as '32*DW/(32*DW+2*DW)==32/34=0.94"

edge loop -> loop probability is 0x783e0f84 / 0x80000000 = 93.94%

I think this is pretty reasonable probability much better matching general high level picture.

There is one cavity though. Please note in this example we have two exiting branches from latch. If we had side exit from another block even new algorithm won't take it into account. I think we should put a TODO on that and follow up later. What do you think?

ebrevnov added inline comments.Aug 11 2020, 11:46 PM
llvm/lib/Analysis/BranchProbabilityInfo.cpp
719

sure

In fact, there is a documentation in header file. In source file I can put more details on behavior in case of rewrite attempt as you requested bellow.

ebrevnov updated this revision to Diff 284984.Aug 12 2020, 12:04 AM

Added a comment to updateEstimatedBlockWeight

wenlei added a subscriber: hoyFB.Aug 12 2020, 9:58 PM
wenlei added a subscriber: wenlei.

Is this work based on any paper/implementation?
Can we add some documentation at the top of the file to get an overall idea of the cost model?

llvm/lib/Analysis/BranchProbabilityInfo.cpp
717

nit: inconsistent comments '' vs '/'

774

do we have a tab here?

This comment was removed by xbolva00.

Is this work based on any paper/implementation?

I don't propose anything completely new here. New algorithm does essentially the same as existing one but

  1. fixes several correctness issues
  2. extends the approach to handle loops and invokes in a universal way

Can we add some documentation at the top of the file to get an overall idea of the cost model?

I don't think "cost model" is applicable here....anyway I can try to describe how thing work using some example. Is this what you are looking for?

llvm/lib/Analysis/BranchProbabilityInfo.cpp
717

Will remove extra /

774

No, I don't have tabs in the code. Looks like phabricator represents indention increase by 4 this way.

..anyway I can try to describe how thing work using some example. Is this what you are looking for?

yes. Thank you.

please run clang-format.

ebrevnov updated this revision to Diff 286234.Aug 18 2020, 3:03 AM

Added description of the algorithm to the header + formatting.

please run clang-format.

Done

..anyway I can try to describe how thing work using some example. Is this what you are looking for?

yes. Thank you.

Done

Is the test failure related to this patch?

Is the test failure related to this patch?

I doubt it is. But I will double check before commit.

ebrevnov edited the summary of this revision. (Show Details)Sep 16 2020, 10:19 PM

I'm still reading. The patch is long but I'd like to avoid splitting it into pieces that might result in massive tests changes (back and forth). Now the test changes seem to be compact.
It would be great if someone could join the reviewing efforts.

yrouban added inline comments.Oct 12 2020, 4:05 AM
llvm/include/llvm/Analysis/BranchProbabilityInfo.h
370

agree, the simple inplace definition would simplify the reading.

376

Returns

388

... then sets ... and returns true.

390

change: ... all blocks/loops potentially affected by ...
to: ... all blocks/loops that might need their weight re-estimated after ...
or something similar in meaning.

396

... propagates ...

407

... sets ...

llvm/lib/Analysis/BranchProbabilityInfo.cpp
107

may be rename to LOWEST_NON_ZERO_WEIGHT or ULP_WEIGHT to explicitly denote its meaning?

110

static_assert(UNREACHABLE_WEIGHT< VERY_LOW_WEIGHT)?

120
  1. group the new related *_WEIGHT into one enum (to distinguish them from the other *_WEIGH constants)
  2. explicitly describe their values (e.g. why COLD_WEIGHT is X times as high as LOWEST_NONZERO_WEIGHT)
  3. avoid using hex notation unless it makes sense (e.g. important for bitwise operations)

Something like the following:

enum {
  ZERO_WEIGHT = 0,
  UNREACHABLE_WEIGHT = ZERO_WEIGHT,
  LOWEST_NONZERO_WEIGHT = 1,
  NORETURN_WEIGHT = LOWEST_NONZERO_WEIGHT,
  UNWIND_WEIGHT = LOWEST_NONZERO_WEIGHT,
  COLD_WEIGHT = 65535 * LOWEST_NONZERO_WEIGHT,
  DEFAULT_WEIGHT = 64 * COLD_WEIGHT
};
125

I suggest that we introduce the UNKNOWN_WEIGHT for most of the DEFAULT_WEIGHT uses to better reflect its meaning and the way it is treated.
DEFAULT_WEIGHT should be used only for UNKNOWN_WEIGHT at the very last step of BPI calculation (after propagation).

657

could be defined along with its declaration for ease of reading

659

could be defined along with its declaration for ease of reading

665

could be defined along with its declaration for ease of reading

717

propagate

718

could be renamed to estimateBlockWeights() (plural)

719

I would remove this doc from the definition (to not deviate from the doc at the declaration) and put the note inside the body.

721

early return would be easier to read and would not need std::tie and std::ignore:

if (!EstimatedBlockWeight.insert({BB, BBWeight}).second)
  return false;
...
return true;
724

'is never returns' sounds strange. Not sure but may be IsNeverReturn? IsNeverReturning? IsDeadEnd?
The function is used only once and only with blocks of two kinds:

  1. blocks that have their terminator instruction preceded with a deoptimize call. So the terminator must be the ret instruction.
  2. blocks with unreachable terminator instruction.

Both cases imply no block successors. So it is better to have
assert(BB->getTerminator()->getNumSuccessors() == 0)
Otherwise (for a generic function) there could be a question: why this condition have a lower priority than a call with Attribute::NoReturn?

call @foo() noreturn
br label %next
724–725

the idiom:

for (BasicBlock *P : predecessors(BB))
727

please explain (what if previous weight is not equal to the new, may be assert(PrevWeight <= BBWeight)?)

741

change TC to trip count

751

rename WorkList to BlockWorkList as it is in computeEstimatedBlockWeight().

759

if we extracted a lambda estimateBlockWeight (which is worth to be a separate member function with its description) then the structure would be concise:

auto estimateBlockWight = [&](const BasicBlock *BB) -> Optional<uint32_t>{
  if (isa<UnreachableInst>(BB->getTerminator()) || BB->getTerminatingDeoptimizeCall())
    return IsNeverReturns(BB) ? NORETURN_WEIGHT : UNREACHABLE_WEIGHT;

  for (const auto *Pred : predecessors(BB))
    if (const auto *II = dyn_cast<InvokeInst>(Pred->getTerminator()))
      if (II->getUnwindDest() == BB)
        return UNWIND_WEIGHT;

  for (const auto &I : *BB)
    if (const CallInst *CI = dyn_cast<CallInst>(&I))
      if (CI->hasFnAttr(Attribute::Cold))
        return COLD_WEIGHT;

  return None;
};

ReversePostOrderTraversal<const Function *> RPOT(&F);
for (const auto *BB : RPOT)
  if (auto BBWeight = estimateBlockWight(BB))
    propagateEstimatedBlockWeight(getLoopBlock(BB), DT, PDT,
                                  BBWeight.getValue(), BlockWorkList,
                                  LoopWorkList);
770

could it result in a deep recursion? may be a worklist would be better?

771–773
BBWeight = IsNeverReturns(BB) ? NORETURN_WEIGHT : UNREACHABLE_WEIGHT;
772

predecessors

788

not needed?

792

remove braces { .. }

Please fix clang-format issues.

yrouban added inline comments.Dec 11 2020, 3:12 AM
llvm/include/llvm/Analysis/BranchProbabilityInfo.h
380

There could be no edge from SrcBB to Successor, but there must be a loop edge. Please rephrase a bit.
... edges leading from loop of SrcBB to loops of Successors' block ...

llvm/lib/Analysis/BranchProbabilityInfo.cpp
674

I would start with the VERY_LOW_WEIGHT. That is if there is no loop exits then the loop is infinite, that is it can be entered at most once.

757

It seems we have to run a similar loop for the PostDom tree.

762–764

I'm not sure if this place is right but I suggest that we have a long description of the Estimated Execution Weight. As its definition starts from the basic blocks weight, I think this place is ok.
.........................................
Lets introduce a notion of Estimated Execution Weight (or just weight). It is defined in several steps.
It will be used to calculate branch probabilities according to the following rule:
if all block successors have their Estimated Execution Weights defined then the branch probabilities can be calculated as if the weights are just !prof branch_weights metadata:
BranchProbability[i] = BranchWeight [i]/( BranchWeight[0] + BranchWeight[1] + ... + BranchWeight[N-1])

Note that the Estimated Execution Weights are not mixed with the profile counters or with the branch_weights metedata. So, it is important for the rule to have all successors’ Estimated Execution Weights defined. If at least one branch undefined then the rule is not applicable.

Definition of Estimated Execution Weight for basic blocks is based on their properties:
Weight is defined for 4 kinds of basic blocks:

  • Unreachable (weight=0) - if the block terminator is the unreachable instruction;
  • Noreturn (weight=1) - if the block has a noreturn call;
  • Unwind (weight=1) - if the block is an unwind branch target (i.e. a landingpad block);
  • Cold (weight=65535) - if the block has a cold call;

These properties are listed in their priority to define the block weight. E.g. if a block is Cold and Noreturn than the Noreturn property defines its weight=1.

Example:
Weights {Cold, Unwind, Unreachable} =>
Probabilities {Cold / (Cold + Unwind + Unreachable), Unwind / (Cold + Unwind + Unreachable), Unreachable /( Cold + Unwind + Unreachable)} =
{65535/ (65535 + 1 + 0), 1/ (65535 + 1 + 0), 0/(65535 + 1 + 0)}

It is worth mention that this block weight does not depend on number of block’s predecessors. In other words, these properties (unreachable, noreturn, unwind, cold) do not become any weaker or stronger if the block has more predecessors. I.e. any edge that ends with such block has its weight defined according to one of the properties of this block.
Example:

       BB1       BB2
      /   \     /   \
ColdBB3   ColdBB4   UnwindBB5

The block ColdBB4 has the weight Cold as the block ColdBB3.
Probability(BB1->ColdBB3) = Probability(BB1->ColdBB4) = 50%
Probability(BB2->ColdBB4) = Cold / (Cold + Unwind)
Probability(BB2->UnwindBB5) = Unwind / (Cold + Unwind)

.........................................
... to be continued for loop weights and edge weights

769
  1. if Edge is a loop exiting edge then we can set weight of the loop of DomBB as it lays on the same dom-postdom line with BB. Then we do not need to add the loop to the worklist but have to add its entering blocks to BlockWorkList.
  2. Avoid checking for loop exiting condition twice:
if (isLoopExitingEdge(Edge)) // Check before isLoopEntering() as it might be also true.
  LoopWorkList.push_back(DomLoopBB);
if (isLoopEnteringEdge(Edge))
  ; // Nothing todo. Why?
else if (!updateEstimatedBlockWeight(...))
  break;

Please comment that we do not update block weights for blocks that are not in the same loop with BB. Instead we update their loop weights.

787

for the unwind property we do not need to iterate over all predecessors.
if block has one unwind predecessors then all its predecessors must be unwind (that is because the block must start with a landing pad). I would suggest to commit this and check only one of the predecessors.

805

nit: remove { }

816

.. BlockWorkList ...
LoopWorkList has a priority over BlockWorklist. Why?

819

'while' would be better and the last 'continue' would not be needed.

822

why should we avoid calculating the loop weight? what if we find another weight for this loop?

828

I would rename it to LoopWeight which is calculated as max of weights of loop exits but at least VERY_LOW_WEIGHT. This is the definition of the loop weight. Why/how is it relevant to block weights?

ebrevnov updated this revision to Diff 313274.Dec 22 2020, 3:05 AM
ebrevnov marked 17 inline comments as done.

Addressed comments

ebrevnov added inline comments.Dec 22 2020, 3:07 AM
llvm/lib/Analysis/BranchProbabilityInfo.cpp
110

Why? Currently, this assert holds but there is nothing preventing us from increasing UNREACHABLE_WEIGHT if needed by the cost model.

120
  1. Done
  2. This is just a heuristic values. You can't have precise description for them.
  3. Done.
125

Which uses are you talking about. I see three uses in calcEstimatedHeuristics only

674

This method does nothing special to loop exits. It just iterates over all edges and returns maximum weight. If there are no edges it signals None.

719

Documentation at declaration and definition provides different level of details. They repeat each other a bit but differs in many ways.

727

We can't assert since single block might have several "incompatible" weights. For example the "unwind" block may have "cold" call. In that case we favor the first one encountered and rely on proper evaluation order.

757

Didn't get.

769

I framed the code this way to explicitly emphasize loop and non-loop cases. The compiler is clever enough to do the suggested optimization behind the scence.

770

Recursion won't happen as first thing updateEstimatedWeight does is checking if we have already processed this block.

787

Then it will always stop on the first predecessor, right? May unwind have more than one predecessor at all?

788

Not sure what was the case, but I came across the case when Pred was null.

816

There should be no difference which one processed first. Someone has to be first :-)

822

In general, weight is assigned to a block/loop when it has final value and can't/shouldn't be changed. However, there are cases when a block/loop inherently has several (possibly "contradicting") weights. For example, "unwind" block may also contain "cold" call. In that case the first set weight is favored and all consequent weights are ignored.

yrouban accepted this revision.Dec 23 2020, 4:27 AM

LGTM if my latest comments are addressed.
Let's give this big work a try.

llvm/include/llvm/Analysis/BranchProbabilityInfo.h
65

override

70

noreturn blocks?

71

each such sounds strange. let's rephrase:
Those blocks get their weights set to BlockExecWeight::UNREACHABLE, BlockExecWeight::NORETURN, BlockExecWeight::UNWIND and BlockExecWeight::COLD respectively. Then the weights are propagated to the other blocks up the domination line.

75

We should say about default weight in context of branch probability calculation.
... the process repeats. Once the process of weights propagation converges we set branch probabilities. It is done for all blocks with at lest one successor with its weight set. For successors without weights we use the default execution weight (BlockExecWeight::DEFAULT). For loop back branches we use their weights scaled by ...

84

may be rename BR1 to BB1 and BR2 to BB2? 'R' seems redundant.

llvm/lib/Analysis/BranchProbabilityInfo.cpp
110

The word unreachable means 'cannot be reached' and thus cannot be executed. How this execution weight be anything except zero? Any non-zero weight for unreachable successors would result in the sum of probabilities of all reachable successors be less than 1.0. I do not insist though.

139–140

I would write that the default weight is used to calculate branch probability of edges which destination blocks does not have their estimated execution weight. The default weight is not set as estimated execution weight and thus is not propagated through domination line.

347

why do we need this static_cast? they are so many in the code. can we get rid of them?

690

Please add a TODO: In addition to this propagation up the domination line consider propagating down the domination line.

720

does it make sense to reverse iterate? noreturn calls are likely close to block ends

779

please add comment:

// Process loops and blocks. Order is not important.
787

I believe that unwind blocks can have more than one predecessor as one catch block for several callsites.

827

... require

856–860

shorter

Weight = std::max(BlockExecWeight::LOWEST_NON_ZERO, Weight.getValueOr(static_cast<uint32_t>(BlockExecWeight::DEFAULT) / TC);
867

Weight = std::max(BlockExecWeight::LOWEST_NON_ZERO, ...);

881–902

none

882

If TotalWeight is zero then weight of ...

llvm/test/Analysis/BranchProbabilityInfo/basic.ll
156–158

space between ; and CHECK

275–276

space

llvm/test/Analysis/BranchProbabilityInfo/deopt-invoke.ll
44

space

llvm/test/Analysis/BranchProbabilityInfo/loop.ll
528

space

552

space please

This revision is now accepted and ready to land.Dec 23 2020, 4:27 AM
ebrevnov updated this revision to Diff 313548.Dec 23 2020, 6:56 AM
ebrevnov marked 13 inline comments as done.

Addressed comments

This revision was landed with ongoing or failed builds.Dec 23 2020, 7:47 AM
This revision was automatically updated to reflect the committed changes.