This is an archive of the discontinued LLVM Phabricator instance.

A post-processing for BFI inference
ClosedPublic

Authored by spupyrev on May 27 2021, 3:36 PM.

Details

Summary

The current implementation for computing relative block frequencies does
not handle correctly control-flow graphs containing irreducible loops. This
results in suboptimally generated binaries, whose perf can be up to 5%
worse than optimal.

To resolve the problem, we apply a post-processing step, which iteratively
updates block frequencies based on the frequencies of their predesessors.
This corresponds to finding the stationary point of the Markov chain by
an iterative method aka "PageRank computation". The algorithm takes at
most O(|E| * IterativeBFIMaxIterations) steps but typically converges faster.

It is turned on by passing option use-iterative-bfi-inference
and applied only for functions containing profile data and irreducible loops.

Tested on SPEC06/17, where it is helping to get correct profile counts for one of
the binaries (403.gcc). In prod binaries, we've seen a speedup of up to 2%-5%
for binaries containing functions with hot irreducible loops.

Diff Detail

Unit TestsFailed

Event Timeline

spupyrev created this revision.May 27 2021, 3:36 PM
spupyrev requested review of this revision.May 27 2021, 3:36 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2021, 3:36 PM
spupyrev edited the summary of this revision. (Show Details)May 27 2021, 3:41 PM
spupyrev added a reviewer: wenlei.
spupyrev edited the summary of this revision. (Show Details)May 27 2021, 3:44 PM
wlei added a subscriber: wlei.May 27 2021, 3:51 PM
spupyrev edited the summary of this revision. (Show Details)May 27 2021, 3:52 PM

thanks for working on this issue. A high level question -- is it possible to do the fix up on a per (irreducible) loop basis?

llvm/test/Transforms/SampleProfile/profile-correlation-irreducible-loops.ll
2

why -enable-new-pm = 0?

11

It will be helpful to draw a simple text art CFG to demonstrate the expected bb counts.

spupyrev updated this revision to Diff 349009.Jun 1 2021, 10:15 AM

Adding asci representation of the test CFGS

spupyrev marked an inline comment as done.Jun 1 2021, 10:30 AM

thanks for working on this issue. A high level question -- is it possible to do the fix up on a per (irreducible) loop basis?

Would you mind expanding on why you'd prefer a per-loop solution?

In general, we found that processing the entire control-flow graph (in opposite to identifying some "problematic" subgraphs first) is much easier from the implementation point of view, while it still keeps the alg fairly efficient. We have a notion of "active" blocks that are being updated, and the algorithm processes only such active vertices. Thus if the input counts are incorrect in a single loop, the algorithm will quickly learn that and will not touch the rest of the graph.

llvm/test/Transforms/SampleProfile/profile-correlation-irreducible-loops.ll
2

Without the option, I get

Cannot specify -analyze under new pass manager, either specify '-enable-new-pm=0', or use the corresponding new pass manager pass, e.g. '-passes=print<scalar-evolution>'. For a full list of passes, see the '--print-passes' flag.

thanks for working on this issue. A high level question -- is it possible to do the fix up on a per (irreducible) loop basis?

Would you mind expanding on why you'd prefer a per-loop solution?

Mainly to reduce compile time overhead, but you have explained that it is not an issue.

In general, we found that processing the entire control-flow graph (in opposite to identifying some "problematic" subgraphs first) is much easier from the implementation point of view, while it still keeps the alg fairly efficient. We have a notion of "active" blocks that are being updated, and the algorithm processes only such active vertices. Thus if the input counts are incorrect in a single loop, the algorithm will quickly learn that and will not touch the rest of the graph.

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1388

why is this map needed (which adds a layer of indirection)?

1397

is it possible, given the blocks are hot?

spupyrev added inline comments.Jun 2 2021, 10:32 AM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1388

The map is used to index successors/predecessors of "hot" blocks, see line 1603.

As an optimization, we don't process all the blocks in a function but only those that can be reached from the entry via branches with a positive probability. These are HotBlocks in the code. Typically, the number of HotBlocks is 2x-5x smaller than the total number of blocks in the function. In order to find an index of a block within the list, we either need to do a linear scan over HotBlocks, or have such an extra map.

1397

In theory, there is no guarantee that at least one of getFloatingBlockFreq is non-zero. (Notice that our "definition" of hot blocks does not rely on the result of the method).

In practice, I've never seen this condition satisfied in our extensive evaluation. So let me change it to an assertion.

spupyrev updated this revision to Diff 349340.Jun 2 2021, 11:51 AM

Adding a few comments and getting rid of unnecessary check

davidxl added inline comments.Jun 3 2021, 9:43 AM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1439

can this overflow?

1440

why not using ScaledNumber::get(uint64_t) interface?

1443

why multiplying by Freq.size()? Should the option description reflect this?

1451

Can this loop be moved into computation of probMatrix and pass the succ vector in to avoid redundant computation.

1489

Does it apply to other backedges too?

spupyrev marked an inline comment as done.Jun 3 2021, 5:58 PM
spupyrev added inline comments.
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1440

here we convert double EPS = 1e-12 to Scaled64, so need some magic. ScaledNumber::get(uint64_t) won't work for values < 1

1443

good point! I renamed the option and adjusted the description

1451

Here Successors represent successors of each vertex in the (auxiliary) graph. It is different from Succs object in the original CFG. (In particular the auxiliary graph contains jumps from all exit block to the entry)

Also I find the current interface a bit cleaner: the main inference method, iterativeInference, takes the probability matrix as input and returns computed frequencies. Successors is an internal variable needed for computation.

1489

not sure I fully understand the question, but we need an adjustment only for self-edges; blocks without self-edges don't need any post-processing

I added a short comment before the loop

davidxl added inline comments.Jun 4 2021, 1:24 PM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489

NewFreq /= OneMinusSelfProb looks like multiply the block freq (one iteration loop) with the average trip count -- that is why I asked if this applies to other backedges.

spupyrev marked an inline comment as done.Jun 4 2021, 1:46 PM
spupyrev added inline comments.
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489

Here is the relevant math:

we want to find a new frequency for block I, Freq[I], such that it is equal to \sum Freq[J] * Prob[J][I], where the sum is taken over all (incoming) jumps (J -> I). These are "ideal" frequencies that BFI is trying to compute.

Clearly if I-th block has no self-edges, then we simply assign Freq[I]:=\sum Freq[J] * Prob[J][I] (that is, no adjustment). However, if there are self_edges, we need to assign Freq[I]:=(\sum Freq[J] * Prob[J][I]) / (1 - Prob[I][I]) (the adjustment in the code)

davidxl added inline comments.Jun 4 2021, 2:16 PM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489

I wonder why the special treatment is needed in the first place.

Suppose we have

 BB1  (init freq = 50)
     |
     V  <-----------------
    BB2 (int freq = 0)   |
    /      \ 90%              |
   / 10%\____________|
<

With iterative fixup, BB2's frequency will converge to 500, which is the right value without any special handling.

spupyrev added inline comments.Jun 4 2021, 3:33 PM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489

Excellent example!

The correct inference here is Freq[BB1] = 50, Freq[BB2] = 500, which is found after 5 iterations using the diff. If we remove the self-edge adjustment, we don't get the right result: it converges to Freq[BB1] = 50, Freq[BB2] = 50 after ~100 iterations. (Observe that we do modify the frequency of the entry block, it is not fixed)

In general, I do not have a proof that the Markov chain always converges to the desired stationary point, if we incorrectly update frequencies (e.g., w/o the self-edge adjustment) -- I suspect it does not.

davidxl added inline comments.Jun 4 2021, 4:06 PM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489

By entry frequency, do you mean BB1's frequency? BB1 won't be active after the first iteration right?

spupyrev added inline comments.Jun 4 2021, 4:56 PM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489

Yes I meant BB1's frequency.

Notice that in order to create a valid Markov chain, we need to add jumps from all exists to the entry. In this case, from BB2 to BB1. So BB1 will be active on later iterations

davidxl added inline comments.Jun 5 2021, 11:34 AM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489

Can you verify if it still works without the adjustment: in the small example, split BB2 into two BBs.

spupyrev added inline comments.Jun 5 2021, 12:11 PM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489

I've commented above:

If we remove the self-edge adjustment, we don't get the right result: it converges to Freq[BB1] = 50, Freq[BB2] = 50 after ~100 iterations.
In general, I do not have a proof that the Markov chain always converges to the desired stationary point, if we incorrectly update frequencies (e.g., w/o the self-edge adjustment) -- I suspect it does not.

What is the concern/question here? In my mind, this is not a "fix/hack" but the correct way of applying iterative inference.

davidxl added inline comments.Jun 5 2021, 12:23 PM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489

There is not much concerns and the patch is almost good to go in. Just want to make sure the algo works for all cases.

Also thanks for the patience!

wenlei added inline comments.Jun 6 2021, 12:32 AM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1385

nit: it looks like this is just finding reachable/live blocks instead of hot blocks, hence the naming could be misleading.

1388

I think we could avoid having the index and extra map if the ProbMatrix (and other data structure) use block pointer instead of index as block identifier, and still remove cold blocks in the processing - i.e. replacing various vectors with map<BasicBlock*, ..>. I think that may be slightly more readable, but using index as identifier is closer to the underlying math.. Either way is fine to me.

1489

Does self probability map to damping factor in original page rank?

llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp
60

perhaps iterative-bfi-precision or something alike is more reflective of what it does?

It'd be helpful to mention somewhere in the comment or description the trade off between precision and run time (iterations needed to converge).

hoy added inline comments.Jun 7 2021, 9:12 AM
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
993

Nit: how about giving the type a name, like using ProbMatrixType = std::vector<std::vector<std::pair<size_t, Scaled64>>>; ?

1495

Wondering if it makes sense to not set I active. When I gets an noticeable update on its counts, its successors should be reprocessed thus they should be set active. But not sure I itself should be reprocessed.

1595

Should the probability of parallel edges be accumulated?

llvm/test/Transforms/SampleProfile/profile-correlation-irreducible-loops.ll
3

The pseudo-probe pass is probably not needed since the test IR comes with pseudo probes.

spupyrev marked 3 inline comments as done.Jun 9 2021, 10:17 AM
spupyrev added inline comments.
llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1388

Sure that's an option but map access is costlier than using raw indices. Since the hottest loop of the implementation needs such an access, (i bet) the change will yield perf loss

1489

No I don't think damping factor is the same.

Self-edges are regular jumps in CFG where the source and the destination blocks coincide. (they are not super frequent e.g., in SPEC, but do appear sometimes). You can always replace a self-edge from block B1->B1 with two jumps B1->BX and BX->B1, where BX is a "dummy" block containing exactly one incoming and one outgoing jump. Then the inference problem on the new modified CFG (that contains no self-edges) is equivalent to the original problem. This transformation also shows that we cannot simply ignore self-edges, as the inference result might change.

1495

This is a very good question, thanks!

I had exactly the same feeling and tried to modify this part as suggested. Unfortunately, it does result in (significantly) slower convergence in some instances, while not providing noticeable benefits. I don't have a rigorous explanation (the alg is a heuristic anyway), but here is my intuition:

We update frequencies of blocks in some order, which is dictated by ActiveSet (currently that's simply a queue). This order does affect the speed of convergence: For example, we want to prioritize updates of frequencies of blocks that are a part of a hot loop. If at an iteration we modify frequency of I, then there is a higher chance that block I will need to be updated later. Thus, we explicitly add it to the queue so that it's updated again as soon as possible.

There are likely alternative strategies here, e.g., having some priority-based queues and/or smarter strategy for deciding when I needs to be updated. I played quite a bit with various versions but couldn't get significant wins over the default (simplest) strategy. So let's keep this question as a future work.

1595

In my tests, I see parallel edges are always coming with exactly the same probability, and their sum might exceed 1.0. I guess that's an assumption/invariant used in BPI.

spupyrev updated this revision to Diff 350941.Jun 9 2021, 10:22 AM

ProbMatrixType

hoy accepted this revision.Jun 10 2021, 9:47 AM

LGTM. Please wait to see if @davidxl and @wenlei has additional comments.

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1495

Thanks for the explanation. Looks like the processing order matters here but hard to track the exact order. Sounds good to keep the current implementation.

1595

You're right. getEdgeProbability returns the sum of all raw edge probabilities from Src to Dst.

/// Get the raw edge probability calculated for the block pair. **This returns the
/// sum of all raw edge probabilities from Src to Dst.**
BranchProbability
BranchProbabilityInfo::getEdgeProbability(const BasicBlock *Src,
                                          const BasicBlock *Dst) const {
  if (!Probs.count(std::make_pair(Src, 0)))
    return BranchProbability(llvm::count(successors(Src), Dst), succ_size(Src));

  auto Prob = BranchProbability::getZero();
  for (const_succ_iterator I = succ_begin(Src), E = succ_end(Src); I != E; ++I)
    if (*I == Dst)
      Prob += Probs.find(std::make_pair(Src, I.getSuccessorIndex()))->second;

  return Prob;
}
This revision is now accepted and ready to land.Jun 10 2021, 9:47 AM
davidxl accepted this revision.Jun 10 2021, 4:29 PM

lgtm

wenlei accepted this revision.Jun 11 2021, 9:30 PM

lgtm, thanks for working on this Sergey!

@davidxl @wmi We found this iterative bfi to work better comparing to irreducible loop header metadata approach. Curious to know if it would produce better results for your workload too.

This revision was landed with ongoing or failed builds.Jun 11 2021, 9:52 PM
Closed by commit rG0a0800c4d10c: A post-processing for BFI inference (authored by spupyrev, committed by wenlei). · Explain Why
This revision was automatically updated to reflect the committed changes.

Will evaluate it. If the results are good, we can flip it on by default.