Page MenuHomePhabricator

[BPI] Improve static heuristics for "cold" paths.
ClosedPublic

Authored by ebrevnov on May 6 2020, 5:36 AM.

Details

Summary

Current approach doesn't work well in cases when multiple paths are predicted to be "cold". By "cold" paths I mean those containing "unreachable" instruction, call marked with 'cold' attribute and 'unwind' handler of 'invoke' instruction. The issue is that heuristics are applied one by one until the first match and essentially ignores relative hotness/coldness
of other paths.

New approach unifies processing of "cold" paths by assigning predefined absolute weight to each block estimated to be "cold". Then we propagate these weights up/down IR similarly to existing approach. And finally set up edge probabilities based on estimated block weights.

One important difference is how we propagate weight up. Existing approach propagates the same weight to all blocks that are post-dominated by a block with some "known" weight. This is useless at least because it always gives 50\50 distribution which is assumed by default anyway. Worse, it causes the algorithm to skip further heuristics and can miss setting more accurate probability. New algorithm propagates the weight up only to the blocks that dominates and post-dominated by a block with some "known" weight. In other words, those blocks that are either always executed or not executed together.

In addition new approach processes loops in an uniform way as well. Essentially loop exit edges are estimated as "cold" paths relative to back edges and should be considered uniformly with other coldness/hotness markers.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
yrouban added inline comments.Oct 12 2020, 4:05 AM
llvm/include/llvm/Analysis/BranchProbabilityInfo.h
370

agree, the simple inplace definition would simplify the reading.

376

Returns

388

... then sets ... and returns true.

390

change: ... all blocks/loops potentially affected by ...
to: ... all blocks/loops that might need their weight re-estimated after ...
or something similar in meaning.

396

... propagates ...

407

... sets ...

llvm/lib/Analysis/BranchProbabilityInfo.cpp
107

may be rename to LOWEST_NON_ZERO_WEIGHT or ULP_WEIGHT to explicitly denote its meaning?

110

static_assert(UNREACHABLE_WEIGHT< VERY_LOW_WEIGHT)?

120
  1. group the new related *_WEIGHT into one enum (to distinguish them from the other *_WEIGH constants)
  2. explicitly describe their values (e.g. why COLD_WEIGHT is X times as high as LOWEST_NONZERO_WEIGHT)
  3. avoid using hex notation unless it makes sense (e.g. important for bitwise operations)

Something like the following:

enum {
  ZERO_WEIGHT = 0,
  UNREACHABLE_WEIGHT = ZERO_WEIGHT,
  LOWEST_NONZERO_WEIGHT = 1,
  NORETURN_WEIGHT = LOWEST_NONZERO_WEIGHT,
  UNWIND_WEIGHT = LOWEST_NONZERO_WEIGHT,
  COLD_WEIGHT = 65535 * LOWEST_NONZERO_WEIGHT,
  DEFAULT_WEIGHT = 64 * COLD_WEIGHT
};
125

I suggest that we introduce the UNKNOWN_WEIGHT for most of the DEFAULT_WEIGHT uses to better reflect its meaning and the way it is treated.
DEFAULT_WEIGHT should be used only for UNKNOWN_WEIGHT at the very last step of BPI calculation (after propagation).

657

could be defined along with its declaration for ease of reading

659

could be defined along with its declaration for ease of reading

665

could be defined along with its declaration for ease of reading

717

propagate

718

could be renamed to estimateBlockWeights() (plural)

719

I would remove this doc from the definition (to not deviate from the doc at the declaration) and put the note inside the body.

721

early return would be easier to read and would not need std::tie and std::ignore:

if (!EstimatedBlockWeight.insert({BB, BBWeight}).second)
  return false;
...
return true;
724

'is never returns' sounds strange. Not sure but may be IsNeverReturn? IsNeverReturning? IsDeadEnd?
The function is used only once and only with blocks of two kinds:

  1. blocks that have their terminator instruction preceded with a deoptimize call. So the terminator must be the ret instruction.
  2. blocks with unreachable terminator instruction.

Both cases imply no block successors. So it is better to have
assert(BB->getTerminator()->getNumSuccessors() == 0)
Otherwise (for a generic function) there could be a question: why this condition have a lower priority than a call with Attribute::NoReturn?

call @foo() noreturn
br label %next
724–725

the idiom:

for (BasicBlock *P : predecessors(BB))
727

please explain (what if previous weight is not equal to the new, may be assert(PrevWeight <= BBWeight)?)

741

change TC to trip count

751

rename WorkList to BlockWorkList as it is in computeEstimatedBlockWeight().

759

if we extracted a lambda estimateBlockWeight (which is worth to be a separate member function with its description) then the structure would be concise:

auto estimateBlockWight = [&](const BasicBlock *BB) -> Optional<uint32_t>{
  if (isa<UnreachableInst>(BB->getTerminator()) || BB->getTerminatingDeoptimizeCall())
    return IsNeverReturns(BB) ? NORETURN_WEIGHT : UNREACHABLE_WEIGHT;

  for (const auto *Pred : predecessors(BB))
    if (const auto *II = dyn_cast<InvokeInst>(Pred->getTerminator()))
      if (II->getUnwindDest() == BB)
        return UNWIND_WEIGHT;

  for (const auto &I : *BB)
    if (const CallInst *CI = dyn_cast<CallInst>(&I))
      if (CI->hasFnAttr(Attribute::Cold))
        return COLD_WEIGHT;

  return None;
};

ReversePostOrderTraversal<const Function *> RPOT(&F);
for (const auto *BB : RPOT)
  if (auto BBWeight = estimateBlockWight(BB))
    propagateEstimatedBlockWeight(getLoopBlock(BB), DT, PDT,
                                  BBWeight.getValue(), BlockWorkList,
                                  LoopWorkList);
770

could it result in a deep recursion? may be a worklist would be better?

771–773
BBWeight = IsNeverReturns(BB) ? NORETURN_WEIGHT : UNREACHABLE_WEIGHT;
772

predecessors

788

not needed?

792

remove braces { .. }

Please fix clang-format issues.

yrouban added inline comments.Dec 11 2020, 3:12 AM
llvm/include/llvm/Analysis/BranchProbabilityInfo.h
380

There could be no edge from SrcBB to Successor, but there must be a loop edge. Please rephrase a bit.
... edges leading from loop of SrcBB to loops of Successors' block ...

llvm/lib/Analysis/BranchProbabilityInfo.cpp
674

I would start with the VERY_LOW_WEIGHT. That is if there is no loop exits then the loop is infinite, that is it can be entered at most once.

757

It seems we have to run a similar loop for the PostDom tree.

762–764

I'm not sure if this place is right but I suggest that we have a long description of the Estimated Execution Weight. As its definition starts from the basic blocks weight, I think this place is ok.
.........................................
Lets introduce a notion of Estimated Execution Weight (or just weight). It is defined in several steps.
It will be used to calculate branch probabilities according to the following rule:
if all block successors have their Estimated Execution Weights defined then the branch probabilities can be calculated as if the weights are just !prof branch_weights metadata:
BranchProbability[i] = BranchWeight [i]/( BranchWeight[0] + BranchWeight[1] + ... + BranchWeight[N-1])

Note that the Estimated Execution Weights are not mixed with the profile counters or with the branch_weights metedata. So, it is important for the rule to have all successors’ Estimated Execution Weights defined. If at least one branch undefined then the rule is not applicable.

Definition of Estimated Execution Weight for basic blocks is based on their properties:
Weight is defined for 4 kinds of basic blocks:

  • Unreachable (weight=0) - if the block terminator is the unreachable instruction;
  • Noreturn (weight=1) - if the block has a noreturn call;
  • Unwind (weight=1) - if the block is an unwind branch target (i.e. a landingpad block);
  • Cold (weight=65535) - if the block has a cold call;

These properties are listed in their priority to define the block weight. E.g. if a block is Cold and Noreturn than the Noreturn property defines its weight=1.

Example:
Weights {Cold, Unwind, Unreachable} =>
Probabilities {Cold / (Cold + Unwind + Unreachable), Unwind / (Cold + Unwind + Unreachable), Unreachable /( Cold + Unwind + Unreachable)} =
{65535/ (65535 + 1 + 0), 1/ (65535 + 1 + 0), 0/(65535 + 1 + 0)}

It is worth mention that this block weight does not depend on number of block’s predecessors. In other words, these properties (unreachable, noreturn, unwind, cold) do not become any weaker or stronger if the block has more predecessors. I.e. any edge that ends with such block has its weight defined according to one of the properties of this block.
Example:

       BB1       BB2
      /   \     /   \
ColdBB3   ColdBB4   UnwindBB5

The block ColdBB4 has the weight Cold as the block ColdBB3.
Probability(BB1->ColdBB3) = Probability(BB1->ColdBB4) = 50%
Probability(BB2->ColdBB4) = Cold / (Cold + Unwind)
Probability(BB2->UnwindBB5) = Unwind / (Cold + Unwind)

.........................................
... to be continued for loop weights and edge weights

769
  1. if Edge is a loop exiting edge then we can set weight of the loop of DomBB as it lays on the same dom-postdom line with BB. Then we do not need to add the loop to the worklist but have to add its entering blocks to BlockWorkList.
  2. Avoid checking for loop exiting condition twice:
if (isLoopExitingEdge(Edge)) // Check before isLoopEntering() as it might be also true.
  LoopWorkList.push_back(DomLoopBB);
if (isLoopEnteringEdge(Edge))
  ; // Nothing todo. Why?
else if (!updateEstimatedBlockWeight(...))
  break;

Please comment that we do not update block weights for blocks that are not in the same loop with BB. Instead we update their loop weights.

787

for the unwind property we do not need to iterate over all predecessors.
if block has one unwind predecessors then all its predecessors must be unwind (that is because the block must start with a landing pad). I would suggest to commit this and check only one of the predecessors.

805

nit: remove { }

816

.. BlockWorkList ...
LoopWorkList has a priority over BlockWorklist. Why?

819

'while' would be better and the last 'continue' would not be needed.

822

why should we avoid calculating the loop weight? what if we find another weight for this loop?

828

I would rename it to LoopWeight which is calculated as max of weights of loop exits but at least VERY_LOW_WEIGHT. This is the definition of the loop weight. Why/how is it relevant to block weights?

ebrevnov updated this revision to Diff 313274.Dec 22 2020, 3:05 AM
ebrevnov marked 17 inline comments as done.

Addressed comments

ebrevnov added inline comments.Dec 22 2020, 3:07 AM
llvm/lib/Analysis/BranchProbabilityInfo.cpp
110

Why? Currently, this assert holds but there is nothing preventing us from increasing UNREACHABLE_WEIGHT if needed by the cost model.

120
  1. Done
  2. This is just a heuristic values. You can't have precise description for them.
  3. Done.
125

Which uses are you talking about. I see three uses in calcEstimatedHeuristics only

674

This method does nothing special to loop exits. It just iterates over all edges and returns maximum weight. If there are no edges it signals None.

719

Documentation at declaration and definition provides different level of details. They repeat each other a bit but differs in many ways.

727

We can't assert since single block might have several "incompatible" weights. For example the "unwind" block may have "cold" call. In that case we favor the first one encountered and rely on proper evaluation order.

757

Didn't get.

769

I framed the code this way to explicitly emphasize loop and non-loop cases. The compiler is clever enough to do the suggested optimization behind the scence.

770

Recursion won't happen as first thing updateEstimatedWeight does is checking if we have already processed this block.

787

Then it will always stop on the first predecessor, right? May unwind have more than one predecessor at all?

788

Not sure what was the case, but I came across the case when Pred was null.

816

There should be no difference which one processed first. Someone has to be first :-)

822

In general, weight is assigned to a block/loop when it has final value and can't/shouldn't be changed. However, there are cases when a block/loop inherently has several (possibly "contradicting") weights. For example, "unwind" block may also contain "cold" call. In that case the first set weight is favored and all consequent weights are ignored.

yrouban accepted this revision.Dec 23 2020, 4:27 AM

LGTM if my latest comments are addressed.
Let's give this big work a try.

llvm/include/llvm/Analysis/BranchProbabilityInfo.h
65

override

70

noreturn blocks?

71

each such sounds strange. let's rephrase:
Those blocks get their weights set to BlockExecWeight::UNREACHABLE, BlockExecWeight::NORETURN, BlockExecWeight::UNWIND and BlockExecWeight::COLD respectively. Then the weights are propagated to the other blocks up the domination line.

75

We should say about default weight in context of branch probability calculation.
... the process repeats. Once the process of weights propagation converges we set branch probabilities. It is done for all blocks with at lest one successor with its weight set. For successors without weights we use the default execution weight (BlockExecWeight::DEFAULT). For loop back branches we use their weights scaled by ...

84

may be rename BR1 to BB1 and BR2 to BB2? 'R' seems redundant.

llvm/lib/Analysis/BranchProbabilityInfo.cpp
110

The word unreachable means 'cannot be reached' and thus cannot be executed. How this execution weight be anything except zero? Any non-zero weight for unreachable successors would result in the sum of probabilities of all reachable successors be less than 1.0. I do not insist though.

139–140

I would write that the default weight is used to calculate branch probability of edges which destination blocks does not have their estimated execution weight. The default weight is not set as estimated execution weight and thus is not propagated through domination line.

347

why do we need this static_cast? they are so many in the code. can we get rid of them?

690

Please add a TODO: In addition to this propagation up the domination line consider propagating down the domination line.

720

does it make sense to reverse iterate? noreturn calls are likely close to block ends

779

please add comment:

// Process loops and blocks. Order is not important.
787

I believe that unwind blocks can have more than one predecessor as one catch block for several callsites.

827

... require

856–860

shorter

Weight = std::max(BlockExecWeight::LOWEST_NON_ZERO, Weight.getValueOr(static_cast<uint32_t>(BlockExecWeight::DEFAULT) / TC);
867

Weight = std::max(BlockExecWeight::LOWEST_NON_ZERO, ...);

881–902

none

882

If TotalWeight is zero then weight of ...

llvm/test/Analysis/BranchProbabilityInfo/basic.ll
156–158

space between ; and CHECK

275–276

space

llvm/test/Analysis/BranchProbabilityInfo/deopt-invoke.ll
44

space

llvm/test/Analysis/BranchProbabilityInfo/loop.ll
528

space

552

space please

This revision is now accepted and ready to land.Dec 23 2020, 4:27 AM
ebrevnov updated this revision to Diff 313548.Dec 23 2020, 6:56 AM
ebrevnov marked 13 inline comments as done.

Addressed comments

This revision was landed with ongoing or failed builds.Dec 23 2020, 7:47 AM
This revision was automatically updated to reflect the committed changes.