This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
17/34
BlockFrequencyInfoImpl.h
-
lib/Analysis/
-
Analysis/
1/1
BlockFrequencyInfoImpl.cpp
-
test/Transforms/SampleProfile/
-
Transforms/
-
SampleProfile/
-
Inputs/
-
profile-correlation-irreducible-loops.prof
3/4
profile-correlation-irreducible-loops.ll

Differential D103289

A post-processing for BFI inference
ClosedPublic

Authored by spupyrev on May 27 2021, 3:36 PM.

Download Raw Diff

Details

Reviewers

hoy
wenlei
wmi
davidxl

Commits

rG0a0800c4d10c: A post-processing for BFI inference

Summary

The current implementation for computing relative block frequencies does
not handle correctly control-flow graphs containing irreducible loops. This
results in suboptimally generated binaries, whose perf can be up to 5%
worse than optimal.

To resolve the problem, we apply a post-processing step, which iteratively
updates block frequencies based on the frequencies of their predesessors.
This corresponds to finding the stationary point of the Markov chain by
an iterative method aka "PageRank computation". The algorithm takes at
most O(|E| * IterativeBFIMaxIterations) steps but typically converges faster.

It is turned on by passing option use-iterative-bfi-inference
and applied only for functions containing profile data and irreducible loops.

Tested on SPEC06/17, where it is helping to get correct profile counts for one of
the binaries (403.gcc). In prod binaries, we've seen a speedup of up to 2%-5%
for binaries containing functions with hot irreducible loops.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spupyrev created this revision.May 27 2021, 3:36 PM

Herald added subscribers: wenlei, hiraditya. · View Herald TranscriptMay 27 2021, 3:36 PM

spupyrev requested review of this revision.May 27 2021, 3:36 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2021, 3:36 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

spupyrev edited the summary of this revision. (Show Details)May 27 2021, 3:41 PM

spupyrev added a reviewer: wenlei.

spupyrev edited the summary of this revision. (Show Details)May 27 2021, 3:44 PM

cc @wmi @xur @davidxl

hoy added reviewers: wmi, davidxl.May 27 2021, 3:46 PM

typo

wlei added a subscriber: wlei.May 27 2021, 3:51 PM

spupyrev edited the summary of this revision. (Show Details)May 27 2021, 3:52 PM

Harbormaster completed remote builds in B106617: Diff 348408.May 27 2021, 4:20 PM

thanks for working on this issue. A high level question -- is it possible to do the fix up on a per (irreducible) loop basis?

llvm/test/Transforms/SampleProfile/profile-correlation-irreducible-loops.ll
2	why -enable-new-pm = 0?
11	It will be helpful to draw a simple text art CFG to demonstrate the expected bb counts.

Adding asci representation of the test CFGS

In D103289#2786306, @davidxl wrote:

thanks for working on this issue. A high level question -- is it possible to do the fix up on a per (irreducible) loop basis?

Would you mind expanding on why you'd prefer a per-loop solution?

In general, we found that processing the entire control-flow graph (in opposite to identifying some "problematic" subgraphs first) is much easier from the implementation point of view, while it still keeps the alg fairly efficient. We have a notion of "active" blocks that are being updated, and the algorithm processes only such active vertices. Thus if the input counts are incorrect in a single loop, the algorithm will quickly learn that and will not touch the rest of the graph.

llvm/test/Transforms/SampleProfile/profile-correlation-irreducible-loops.ll

Without the option, I get

Cannot specify -analyze under new pass manager, either specify '-enable-new-pm=0', or use the corresponding new pass manager pass, e.g. '-passes=print<scalar-evolution>'. For a full list of passes, see the '--print-passes' flag.

Harbormaster completed remote builds in B107064: Diff 349009.Jun 1 2021, 10:52 AM

In D103289#2791449, @spupyrev wrote:

In D103289#2786306, @davidxl wrote:

thanks for working on this issue. A high level question -- is it possible to do the fix up on a per (irreducible) loop basis?

Would you mind expanding on why you'd prefer a per-loop solution?

Mainly to reduce compile time overhead, but you have explained that it is not an issue.

In general, we found that processing the entire control-flow graph (in opposite to identifying some "problematic" subgraphs first) is much easier from the implementation point of view, while it still keeps the alg fairly efficient. We have a notion of "active" blocks that are being updated, and the algorithm processes only such active vertices. Thus if the input counts are incorrect in a single loop, the algorithm will quickly learn that and will not touch the rest of the graph.

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1388	why is this map needed (which adds a layer of indirection)?
1397	is it possible, given the blocks are hot?

spupyrev added inline comments.Jun 2 2021, 10:32 AM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1388	The map is used to index successors/predecessors of "hot" blocks, see line 1603. As an optimization, we don't process all the blocks in a function but only those that can be reached from the entry via branches with a positive probability. These are `HotBlocks` in the code. Typically, the number of HotBlocks is 2x-5x smaller than the total number of blocks in the function. In order to find an index of a block within the list, we either need to do a linear scan over HotBlocks, or have such an extra map.
1397	In theory, there is no guarantee that at least one of `getFloatingBlockFreq` is non-zero. (Notice that our "definition" of hot blocks does not rely on the result of the method). In practice, I've never seen this condition satisfied in our extensive evaluation. So let me change it to an assertion.

Adding a few comments and getting rid of unnecessary check

Harbormaster completed remote builds in B107297: Diff 349340.Jun 2 2021, 12:33 PM

davidxl added inline comments.Jun 3 2021, 9:43 AM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1439	can this overflow?
1440	why not using ScaledNumber::get(uint64_t) interface?
1443	why multiplying by Freq.size()? Should the option description reflect this?
1451	Can this loop be moved into computation of probMatrix and pass the succ vector in to avoid redundant computation.
1489	Does it apply to other backedges too?

comments

spupyrev marked an inline comment as done.Jun 3 2021, 5:58 PM

spupyrev added inline comments.

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1440	here we convert double `EPS = 1e-12` to Scaled64, so need some magic. `ScaledNumber::get(uint64_t)` won't work for values < 1
1443	good point! I renamed the option and adjusted the description
1451	Here `Successors` represent successors of each vertex in the (auxiliary) graph. It is different from `Succs` object in the original CFG. (In particular the auxiliary graph contains jumps from all exit block to the entry) Also I find the current interface a bit cleaner: the main inference method, `iterativeInference`, takes the probability matrix as input and returns computed frequencies. `Successors` is an internal variable needed for computation.
1489	not sure I fully understand the question, but we need an adjustment only for self-edges; blocks without self-edges don't need any post-processing I added a short comment before the loop

Harbormaster completed remote builds in B107584: Diff 349733.Jun 3 2021, 6:20 PM

davidxl added inline comments.Jun 4 2021, 1:24 PM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489	NewFreq /= OneMinusSelfProb looks like multiply the block freq (one iteration loop) with the average trip count -- that is why I asked if this applies to other backedges.

spupyrev marked an inline comment as done.Jun 4 2021, 1:46 PM

spupyrev added inline comments.

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489	Here is the relevant math: we want to find a new frequency for block I, `Freq[I]`, such that it is equal to `\sum Freq[J] * Prob[J][I]`, where the sum is taken over all (incoming) jumps (J -> I). These are "ideal" frequencies that BFI is trying to compute. Clearly if I-th block has no self-edges, then we simply assign `Freq[I]:=\sum Freq[J] * Prob[J][I]` (that is, no adjustment). However, if there are self_edges, we need to assign `Freq[I]:=(\sum Freq[J] * Prob[J][I]) / (1 - Prob[I][I])` (the adjustment in the code)

davidxl added inline comments.Jun 4 2021, 2:16 PM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489	I wonder why the special treatment is needed in the first place. Suppose we have BB1 (init freq = 50) \| V <----------------- BB2 (int freq = 0) \| / \ 90% \| / 10%\____________\| < With iterative fixup, BB2's frequency will converge to 500, which is the right value without any special handling.

spupyrev added inline comments.Jun 4 2021, 3:33 PM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489	Excellent example! The correct inference here is `Freq[BB1] = 50, Freq[BB2] = 500`, which is found after 5 iterations using the diff. If we remove the self-edge adjustment, we don't get the right result: it converges to `Freq[BB1] = 50, Freq[BB2] = 50` after ~100 iterations. (Observe that we do modify the frequency of the entry block, it is not fixed) In general, I do not have a proof that the Markov chain always converges to the desired stationary point, if we incorrectly update frequencies (e.g., w/o the self-edge adjustment) -- I suspect it does not.

davidxl added inline comments.Jun 4 2021, 4:06 PM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489	By entry frequency, do you mean BB1's frequency? BB1 won't be active after the first iteration right?

spupyrev added inline comments.Jun 4 2021, 4:56 PM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489	Yes I meant BB1's frequency. Notice that in order to create a valid Markov chain, we need to add jumps from all exists to the entry. In this case, from BB2 to BB1. So BB1 will be active on later iterations

davidxl added inline comments.Jun 5 2021, 11:34 AM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489	Can you verify if it still works without the adjustment: in the small example, split BB2 into two BBs.

spupyrev added inline comments.Jun 5 2021, 12:11 PM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489	I've commented above: If we remove the self-edge adjustment, we don't get the right result: it converges to Freq[BB1] = 50, Freq[BB2] = 50 after ~100 iterations. In general, I do not have a proof that the Markov chain always converges to the desired stationary point, if we incorrectly update frequencies (e.g., w/o the self-edge adjustment) -- I suspect it does not. What is the concern/question here? In my mind, this is not a "fix/hack" but the correct way of applying iterative inference.

davidxl added inline comments.Jun 5 2021, 12:23 PM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1489	There is not much concerns and the patch is almost good to go in. Just want to make sure the algo works for all cases.

Also thanks for the patience!

wenlei added inline comments.Jun 6 2021, 12:32 AM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1385	nit: it looks like this is just finding reachable/live blocks instead of hot blocks, hence the naming could be misleading.
1388	I think we could avoid having the index and extra map if the ProbMatrix (and other data structure) use block pointer instead of index as block identifier, and still remove cold blocks in the processing - i.e. replacing various vectors with map<BasicBlock*, ..>. I think that may be slightly more readable, but using index as identifier is closer to the underlying math.. Either way is fine to me.
1489	Does self probability map to damping factor in original page rank?
llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp
60	perhaps `iterative-bfi-precision` or something alike is more reflective of what it does? It'd be helpful to mention somewhere in the comment or description the trade off between precision and run time (iterations needed to converge).

hoy added inline comments.Jun 7 2021, 9:12 AM

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
993	Nit: how about giving the type a name, like `using ProbMatrixType = std::vector<std::vector<std::pair<size_t, Scaled64>>>;` ?
1495	Wondering if it makes sense to not set `I` active. When `I` gets an noticeable update on its counts, its successors should be reprocessed thus they should be set active. But not sure `I` itself should be reprocessed.
1595	Should the probability of parallel edges be accumulated?
llvm/test/Transforms/SampleProfile/profile-correlation-irreducible-loops.ll
3	The pseudo-probe pass is probably not needed since the test IR comes with pseudo probes.

comments

spupyrev marked 3 inline comments as done.Jun 9 2021, 10:17 AM

spupyrev added inline comments.

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h
1388	Sure that's an option but map access is costlier than using raw indices. Since the hottest loop of the implementation needs such an access, (i bet) the change will yield perf loss
1489	No I don't think damping factor is the same. Self-edges are regular jumps in CFG where the source and the destination blocks coincide. (they are not super frequent e.g., in SPEC, but do appear sometimes). You can always replace a self-edge from block B1->B1 with two jumps B1->BX and BX->B1, where BX is a "dummy" block containing exactly one incoming and one outgoing jump. Then the inference problem on the new modified CFG (that contains no self-edges) is equivalent to the original problem. This transformation also shows that we cannot simply ignore self-edges, as the inference result might change.
1495	This is a very good question, thanks! I had exactly the same feeling and tried to modify this part as suggested. Unfortunately, it does result in (significantly) slower convergence in some instances, while not providing noticeable benefits. I don't have a rigorous explanation (the alg is a heuristic anyway), but here is my intuition: We update frequencies of blocks in some order, which is dictated by `ActiveSet` (currently that's simply a queue). This order does affect the speed of convergence: For example, we want to prioritize updates of frequencies of blocks that are a part of a hot loop. If at an iteration we modify frequency of `I`, then there is a higher chance that block `I` will need to be updated later. Thus, we explicitly add it to the queue so that it's updated again as soon as possible. There are likely alternative strategies here, e.g., having some priority-based queues and/or smarter strategy for deciding when `I` needs to be updated. I played quite a bit with various versions but couldn't get significant wins over the default (simplest) strategy. So let's keep this question as a future work.
1595	In my tests, I see parallel edges are always coming with exactly the same probability, and their sum might exceed 1.0. I guess that's an assumption/invariant used in BPI.

ProbMatrixType

Harbormaster completed remote builds in B108449: Diff 350941.Jun 9 2021, 12:06 PM

LGTM. Please wait to see if @davidxl and @wenlei has additional comments.

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h

1495

Thanks for the explanation. Looks like the processing order matters here but hard to track the exact order. Sounds good to keep the current implementation.

1595

You're right. getEdgeProbability returns the sum of all raw edge probabilities from Src to Dst.

/// Get the raw edge probability calculated for the block pair. **This returns the
/// sum of all raw edge probabilities from Src to Dst.**
BranchProbability
BranchProbabilityInfo::getEdgeProbability(const BasicBlock *Src,
                                          const BasicBlock *Dst) const {
  if (!Probs.count(std::make_pair(Src, 0)))
    return BranchProbability(llvm::count(successors(Src), Dst), succ_size(Src));

  auto Prob = BranchProbability::getZero();
  for (const_succ_iterator I = succ_begin(Src), E = succ_end(Src); I != E; ++I)
    if (*I == Dst)
      Prob += Probs.find(std::make_pair(Src, I.getSuccessorIndex()))->second;

  return Prob;
}

This revision is now accepted and ready to land.Jun 10 2021, 9:47 AM

lgtm

lgtm, thanks for working on this Sergey!

@davidxl @wmi We found this iterative bfi to work better comparing to irreducible loop header metadata approach. Curious to know if it would produce better results for your workload too.

This revision was landed with ongoing or failed builds.Jun 11 2021, 9:52 PM

Closed by commit rG0a0800c4d10c: A post-processing for BFI inference (authored by spupyrev, committed by wenlei). · Explain Why

This revision was automatically updated to reflect the committed changes.

wenlei added a commit: rG0a0800c4d10c: A post-processing for BFI inference.

Will evaluate it. If the results are good, we can flip it on by default.

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

BlockFrequencyInfoImpl.h

337 lines

lib/

Analysis/

BlockFrequencyInfoImpl.cpp

14 lines

test/

Transforms/

SampleProfile/

Inputs/

profile-correlation-irreducible-loops.prof

19 lines

profile-correlation-irreducible-loops.ll

187 lines

Diff 351630

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h

Show All 36 Lines
#include <algorithm>		#include <algorithm>
#include <cassert>		#include <cassert>
#include <cstddef>		#include <cstddef>
#include <cstdint>		#include <cstdint>
#include <deque>		#include <deque>
#include <iterator>		#include <iterator>
#include <limits>		#include <limits>
#include <list>		#include <list>
		#include <queue>
#include <string>		#include <string>
		#include <unordered_set>
#include <utility>		#include <utility>
#include <vector>		#include <vector>

#define DEBUG_TYPE "block-freq"		#define DEBUG_TYPE "block-freq"

namespace llvm {		namespace llvm {
extern llvm::cl::opt<bool> CheckBFIUnknownBlockQueries;		extern llvm::cl::opt<bool> CheckBFIUnknownBlockQueries;

		extern llvm::cl::opt<bool> UseIterativeBFIInference;
		extern llvm::cl::opt<unsigned> IterativeBFIMaxIterationsPerBlock;
		extern llvm::cl::opt<double> IterativeBFIPrecision;

class BranchProbabilityInfo;		class BranchProbabilityInfo;
class Function;		class Function;
class Loop;		class Loop;
class LoopInfo;		class LoopInfo;
class MachineBasicBlock;		class MachineBasicBlock;
class MachineBranchProbabilityInfo;		class MachineBranchProbabilityInfo;
class MachineFunction;		class MachineFunction;
class MachineLoop;		class MachineLoop;
▲ Show 20 Lines • Show All 900 Lines • ▼ Show 20 Lines	template <class BT> class BlockFrequencyInfoImpl : BlockFrequencyInfoImplBase {
///		///
/// \post \a tryToComputeMassInFunction() has returned \c true.		/// \post \a tryToComputeMassInFunction() has returned \c true.
void computeMassInFunction();		void computeMassInFunction();

std::string getBlockName(const BlockNode &Node) const override {		std::string getBlockName(const BlockNode &Node) const override {
return bfi_detail::getBlockName(getBlock(Node));		return bfi_detail::getBlockName(getBlock(Node));
}		}

		/// The current implementation for computing relative block frequencies does
		/// not handle correctly control-flow graphs containing irreducible loops. To
		/// resolve the problem, we apply a post-processing step, which iteratively
		/// updates block frequencies based on the frequencies of their predesessors.
		/// This corresponds to finding the stationary point of the Markov chain by
		/// an iterative method aka "PageRank computation".
		/// The algorithm takes at most O(\|E\| * IterativeBFIMaxIterations) steps but
		/// typically converges faster.
		///
		/// Decide whether we want to apply iterative inference for a given function.
		bool needIterativeInference() const;

		/// Apply an iterative post-processing to infer correct counts for irr loops.
		void applyIterativeInference();

		using ProbMatrixType = std::vector<std::vector<std::pair<size_t, Scaled64>>>;

		/// Run iterative inference for a probability matrix and initial frequencies.
		hoyUnsubmitted Not Done Reply Inline Actions Nit: how about giving the type a name, like `using ProbMatrixType = std::vector<std::vector<std::pair<size_t, Scaled64>>>;` ? hoy: Nit: how about giving the type a name, like `using ProbMatrixType = std::vector<std::vector<std…
		void iterativeInference(const ProbMatrixType &ProbMatrix,
		std::vector<Scaled64> &Freq) const;

		/// Find all blocks to apply inference on, that is, reachable from the entry
		/// and backward reachable from exists along edges with positive probability.
		void findReachableBlocks(std::vector<const BlockT *> &Blocks) const;

		/// Build a matrix of probabilities with transitions (edges) between the
		/// blocks: ProbMatrix[I] holds pairs (J, P), where Pr[J -> I \| J] = P
		void initTransitionProbabilities(
		const std::vector<const BlockT *> &Blocks,
		const DenseMap<const BlockT *, size_t> &BlockIndex,
		ProbMatrixType &ProbMatrix) const;

		#ifndef NDEBUG
		/// Compute the discrepancy between current block frequencies and the
		/// probability matrix.
		Scaled64 discrepancy(const ProbMatrixType &ProbMatrix,
		const std::vector<Scaled64> &Freq) const;
		#endif

public:		public:
BlockFrequencyInfoImpl() = default;		BlockFrequencyInfoImpl() = default;

const FunctionT *getFunction() const { return F; }		const FunctionT *getFunction() const { return F; }

void calculate(const FunctionT &F, const BranchProbabilityInfoT &BPI,		void calculate(const FunctionT &F, const BranchProbabilityInfoT &BPI,
const LoopInfoT &LI);		const LoopInfoT &LI);

▲ Show 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	void BlockFrequencyInfoImpl<BT>::calculate(const FunctionT &F,
initializeRPOT();		initializeRPOT();
initializeLoops();		initializeLoops();

// Visit loops in post-order to find the local mass distribution, and then do		// Visit loops in post-order to find the local mass distribution, and then do
// the full function.		// the full function.
computeMassInLoops();		computeMassInLoops();
computeMassInFunction();		computeMassInFunction();
unwrapLoops();		unwrapLoops();
		// Apply a post-processing step improving computed frequencies for functions
		// with irreducible loops.
		if (needIterativeInference())
		applyIterativeInference();
finalizeMetrics();		finalizeMetrics();

if (CheckBFIUnknownBlockQueries) {		if (CheckBFIUnknownBlockQueries) {
// To detect BFI queries for unknown blocks, add entries for unreachable		// To detect BFI queries for unknown blocks, add entries for unreachable
// blocks, if any. This is to distinguish between known/existing unreachable		// blocks, if any. This is to distinguish between known/existing unreachable
// blocks and unknown blocks.		// blocks and unknown blocks.
for (const BlockT &BB : F)		for (const BlockT &BB : F)
if (!Nodes.count(&BB))		if (!Nodes.count(&BB))
▲ Show 20 Lines • Show All 204 Lines • ▼ Show 20 Lines	template <class BT> void BlockFrequencyInfoImpl<BT>::computeMassInFunction() {
if (tryToComputeMassInFunction())		if (tryToComputeMassInFunction())
return;		return;
computeIrreducibleMass(nullptr, Loops.begin());		computeIrreducibleMass(nullptr, Loops.begin());
if (tryToComputeMassInFunction())		if (tryToComputeMassInFunction())
return;		return;
llvm_unreachable("unhandled irreducible control flow");		llvm_unreachable("unhandled irreducible control flow");
}		}

		template <class BT>
		bool BlockFrequencyInfoImpl<BT>::needIterativeInference() const {
		if (!UseIterativeBFIInference)
		return false;
		if (!F->getFunction().hasProfileData())
		return false;
		// Apply iterative inference only if the function contains irreducible loops;
		// otherwise, computed block frequencies are reasonably correct.
		for (auto L = Loops.rbegin(), E = Loops.rend(); L != E; ++L) {
		if (L->isIrreducible())
		return true;
		}
		return false;
		}

		template <class BT> void BlockFrequencyInfoImpl<BT>::applyIterativeInference() {
		// Extract blocks for processing: a block is considered for inference iff it
		// can be reached from the entry by edges with a positive probability.
		// Non-processed blocks are assigned with the zero frequency and are ignored
		// in the computation
		std::vector<const BlockT *> ReachableBlocks;
		wenleiUnsubmitted Done Reply Inline Actions nit: it looks like this is just finding reachable/live blocks instead of hot blocks, hence the naming could be misleading. wenlei: nit: it looks like this is just finding reachable/live blocks instead of hot blocks, hence the…
		findReachableBlocks(ReachableBlocks);
		if (ReachableBlocks.empty())
		return;
		davidxlUnsubmitted Not Done Reply Inline Actions why is this map needed (which adds a layer of indirection)? davidxl: why is this map needed (which adds a layer of indirection)?
		spupyrevAuthorUnsubmitted Done Reply Inline Actions The map is used to index successors/predecessors of "hot" blocks, see line 1603. As an optimization, we don't process all the blocks in a function but only those that can be reached from the entry via branches with a positive probability. These are `HotBlocks` in the code. Typically, the number of HotBlocks is 2x-5x smaller than the total number of blocks in the function. In order to find an index of a block within the list, we either need to do a linear scan over HotBlocks, or have such an extra map. spupyrev: The map is used to index successors/predecessors of "hot" blocks, see line 1603. As an…
		wenleiUnsubmitted Not Done Reply Inline Actions I think we could avoid having the index and extra map if the ProbMatrix (and other data structure) use block pointer instead of index as block identifier, and still remove cold blocks in the processing - i.e. replacing various vectors with map<BasicBlock, ..>. I think that may be slightly more readable, but using index as identifier is closer to the underlying math.. Either way is fine to me. wenlei:* I think we could avoid having the index and extra map if the ProbMatrix (and other data…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions Sure that's an option but map access is costlier than using raw indices. Since the hottest loop of the implementation needs such an access, (i bet) the change will yield perf loss spupyrev: Sure that's an option but map access is costlier than using raw indices. Since the hottest loop…

		// The map is used to to index successors/predecessors of reachable blocks in
		// the ReachableBlocks vector
		DenseMap<const BlockT *, size_t> BlockIndex;
		// Extract initial frequencies for the reachable blocks
		auto Freq = std::vector<Scaled64>(ReachableBlocks.size());
		Scaled64 SumFreq;
		for (size_t I = 0; I < ReachableBlocks.size(); I++) {
		const BlockT *BB = ReachableBlocks[I];
		davidxlUnsubmitted Not Done Reply Inline Actions is it possible, given the blocks are hot? davidxl: is it possible, given the blocks are hot?
		spupyrevAuthorUnsubmitted Done Reply Inline Actions In theory, there is no guarantee that at least one of `getFloatingBlockFreq` is non-zero. (Notice that our "definition" of hot blocks does not rely on the result of the method). In practice, I've never seen this condition satisfied in our extensive evaluation. So let me change it to an assertion. spupyrev: In theory, there is no guarantee that at least one of `getFloatingBlockFreq` is non-zero.
		BlockIndex[BB] = I;
		Freq[I] = getFloatingBlockFreq(BB);
		SumFreq += Freq[I];
		}
		assert(!SumFreq.isZero() && "empty initial block frequencies");

		LLVM_DEBUG(dbgs() << "Applying iterative inference for " << F->getName()
		<< " with " << ReachableBlocks.size() << " blocks\n");

		// Normalizing frequencies so they sum up to 1.0
		for (auto &Value : Freq) {
		Value /= SumFreq;
		}

		// Setting up edge probabilities using sparse matrix representation:
		// ProbMatrix[I] holds a vector of pairs (J, P) where Pr[J -> I \| J] = P
		ProbMatrixType ProbMatrix;
		initTransitionProbabilities(ReachableBlocks, BlockIndex, ProbMatrix);

		// Run the propagation
		iterativeInference(ProbMatrix, Freq);

		// Assign computed frequency values
		for (const BlockT &BB : *F) {
		auto Node = getNode(&BB);
		if (!Node.isValid())
		continue;
		if (BlockIndex.count(&BB)) {
		Freqs[Node.Index].Scaled = Freq[BlockIndex[&BB]];
		} else {
		Freqs[Node.Index].Scaled = Scaled64::getZero();
		}
		}
		}

		template <class BT>
		void BlockFrequencyInfoImpl<BT>::iterativeInference(
		const ProbMatrixType &ProbMatrix, std::vector<Scaled64> &Freq) const {
		assert(0.0 < IterativeBFIPrecision && IterativeBFIPrecision < 1.0 &&
		"incorrectly specified precision");
		// Convert double precision to Scaled64
		const auto Precision =
		davidxlUnsubmitted Not Done Reply Inline Actions can this overflow? davidxl: can this overflow?
		Scaled64::getInverse(static_cast<uint64_t>(1.0 / IterativeBFIPrecision));
		davidxlUnsubmitted Not Done Reply Inline Actions why not using ScaledNumber::get(uint64_t) interface? davidxl: why not using ScaledNumber::get(uint64_t) interface?
		spupyrevAuthorUnsubmitted Done Reply Inline Actions here we convert double `EPS = 1e-12` to Scaled64, so need some magic. `ScaledNumber::get(uint64_t)` won't work for values < 1 spupyrev: here we convert double `EPS = 1e-12` to Scaled64, so need some magic. `ScaledNumber::get…
		const size_t MaxIterations = IterativeBFIMaxIterationsPerBlock * Freq.size();

		#ifndef NDEBUG
		davidxlUnsubmitted Done Reply Inline Actions why multiplying by Freq.size()? Should the option description reflect this? davidxl: why multiplying by Freq.size()? Should the option description reflect this?
		spupyrevAuthorUnsubmitted Done Reply Inline Actions good point! I renamed the option and adjusted the description spupyrev: good point! I renamed the option and adjusted the description
		LLVM_DEBUG(dbgs() << " Initial discrepancy = "
		<< discrepancy(ProbMatrix, Freq).toString() << "\n");
		#endif

		// Successors[I] holds unique sucessors of the I-th block
		auto Successors = std::vector<std::vector<size_t>>(Freq.size());
		for (size_t I = 0; I < Freq.size(); I++) {
		for (auto &Jump : ProbMatrix[I]) {
		davidxlUnsubmitted Not Done Reply Inline Actions Can this loop be moved into computation of probMatrix and pass the succ vector in to avoid redundant computation. davidxl: Can this loop be moved into computation of probMatrix and pass the succ vector in to avoid…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions Here `Successors` represent successors of each vertex in the (auxiliary) graph. It is different from `Succs` object in the original CFG. (In particular the auxiliary graph contains jumps from all exit block to the entry) Also I find the current interface a bit cleaner: the main inference method, `iterativeInference`, takes the probability matrix as input and returns computed frequencies. `Successors` is an internal variable needed for computation. spupyrev: Here `Successors` represent successors of each vertex in the (auxiliary) graph. It is different…
		Successors[Jump.first].push_back(I);
		}
		}

		// To speedup computation, we maintain a set of "active" blocks whose
		// frequencies need to be updated based on the incoming edges.
		// The set is dynamic and changes after every update. Initially all blocks
		// with a positive frequency are active
		auto IsActive = std::vector<bool>(Freq.size(), false);
		std::queue<size_t> ActiveSet;
		for (size_t I = 0; I < Freq.size(); I++) {
		if (Freq[I] > 0) {
		ActiveSet.push(I);
		IsActive[I] = true;
		}
		}

		// Iterate over the blocks propagating frequencies
		size_t It = 0;
		while (It++ < MaxIterations && !ActiveSet.empty()) {
		size_t I = ActiveSet.front();
		ActiveSet.pop();
		IsActive[I] = false;

		// Compute a new frequency for the block: NewFreq := Freq \times ProbMatrix.
		// A special care is taken for self-edges that needs to be scaled by
		// (1.0 - SelfProb), where SelfProb is the sum of probabilities on the edges
		Scaled64 NewFreq;
		Scaled64 OneMinusSelfProb = Scaled64::getOne();
		for (auto &Jump : ProbMatrix[I]) {
		if (Jump.first == I) {
		OneMinusSelfProb -= Jump.second;
		} else {
		NewFreq += Freq[Jump.first] * Jump.second;
		}
		}
		if (OneMinusSelfProb != Scaled64::getOne())
		NewFreq /= OneMinusSelfProb;
		davidxlUnsubmitted Not Done Reply Inline Actions Does it apply to other backedges too? davidxl: Does it apply to other backedges too?
		spupyrevAuthorUnsubmitted Done Reply Inline Actions not sure I fully understand the question, but we need an adjustment only for self-edges; blocks without self-edges don't need any post-processing I added a short comment before the loop spupyrev: not sure I fully understand the question, but we need an adjustment only for self-edges; blocks…
		davidxlUnsubmitted Not Done Reply Inline Actions NewFreq /= OneMinusSelfProb looks like multiply the block freq (one iteration loop) with the average trip count -- that is why I asked if this applies to other backedges. davidxl: NewFreq /= OneMinusSelfProb looks like multiply the block freq (one iteration loop) with the…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions Here is the relevant math: we want to find a new frequency for block I, `Freq[I]`, such that it is equal to `\sum Freq[J] * Prob[J][I]`, where the sum is taken over all (incoming) jumps (J -> I). These are "ideal" frequencies that BFI is trying to compute. Clearly if I-th block has no self-edges, then we simply assign `Freq[I]:=\sum Freq[J] * Prob[J][I]` (that is, no adjustment). However, if there are self_edges, we need to assign `Freq[I]:=(\sum Freq[J] * Prob[J][I]) / (1 - Prob[I][I])` (the adjustment in the code) spupyrev: Here is the relevant math: we want to find a new frequency for block I, `Freq[I]`, such that…
		davidxlUnsubmitted Not Done Reply Inline Actions I wonder why the special treatment is needed in the first place. Suppose we have BB1 (init freq = 50) \| V <----------------- BB2 (int freq = 0) \| / \ 90% \| / 10%\____________\| < With iterative fixup, BB2's frequency will converge to 500, which is the right value without any special handling. davidxl: I wonder why the special treatment is needed in the first place. Suppose we have ``` BB1…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions Excellent example! The correct inference here is `Freq[BB1] = 50, Freq[BB2] = 500`, which is found after 5 iterations using the diff. If we remove the self-edge adjustment, we don't get the right result: it converges to `Freq[BB1] = 50, Freq[BB2] = 50` after ~100 iterations. (Observe that we do modify the frequency of the entry block, it is not fixed) In general, I do not have a proof that the Markov chain always converges to the desired stationary point, if we incorrectly update frequencies (e.g., w/o the self-edge adjustment) -- I suspect it does not. spupyrev: Excellent example! The correct inference here is `Freq[BB1] = 50, Freq[BB2] = 500`, which is…
		davidxlUnsubmitted Not Done Reply Inline Actions By entry frequency, do you mean BB1's frequency? BB1 won't be active after the first iteration right? davidxl: By entry frequency, do you mean BB1's frequency? BB1 won't be active after the first iteration…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions Yes I meant BB1's frequency. Notice that in order to create a valid Markov chain, we need to add jumps from all exists to the entry. In this case, from BB2 to BB1. So BB1 will be active on later iterations spupyrev: Yes I meant BB1's frequency. Notice that in order to create a valid Markov chain, we need to…
		davidxlUnsubmitted Not Done Reply Inline Actions Can you verify if it still works without the adjustment: in the small example, split BB2 into two BBs. davidxl: Can you verify if it still works without the adjustment: in the small example, split BB2 into…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions I've commented above: If we remove the self-edge adjustment, we don't get the right result: it converges to Freq[BB1] = 50, Freq[BB2] = 50 after ~100 iterations. In general, I do not have a proof that the Markov chain always converges to the desired stationary point, if we incorrectly update frequencies (e.g., w/o the self-edge adjustment) -- I suspect it does not. What is the concern/question here? In my mind, this is not a "fix/hack" but the correct way of applying iterative inference. spupyrev: I've commented above: > If we remove the self-edge adjustment, we don't get the right result…
		davidxlUnsubmitted Not Done Reply Inline Actions There is not much concerns and the patch is almost good to go in. Just want to make sure the algo works for all cases. davidxl: There is not much concerns and the patch is almost good to go in. Just want to make sure the…
		wenleiUnsubmitted Not Done Reply Inline Actions Does self probability map to damping factor in original page rank? wenlei: Does self probability map to damping factor in original page rank?
		spupyrevAuthorUnsubmitted Done Reply Inline Actions No I don't think damping factor is the same. Self-edges are regular jumps in CFG where the source and the destination blocks coincide. (they are not super frequent e.g., in SPEC, but do appear sometimes). You can always replace a self-edge from block B1->B1 with two jumps B1->BX and BX->B1, where BX is a "dummy" block containing exactly one incoming and one outgoing jump. Then the inference problem on the new modified CFG (that contains no self-edges) is equivalent to the original problem. This transformation also shows that we cannot simply ignore self-edges, as the inference result might change. spupyrev: No I don't think damping factor is the same. Self-edges are regular jumps in CFG where the…

		// If the block's frequency has changed enough, then
		// make sure the block and its successors are in the active set
		auto Change = Freq[I] >= NewFreq ? Freq[I] - NewFreq : NewFreq - Freq[I];
		if (Change > Precision) {
		ActiveSet.push(I);
		hoyUnsubmitted Not Done Reply Inline Actions Wondering if it makes sense to not set `I` active. When `I` gets an noticeable update on its counts, its successors should be reprocessed thus they should be set active. But not sure `I` itself should be reprocessed. hoy: Wondering if it makes sense to not set `I` active. When `I` gets an noticeable update on its…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions This is a very good question, thanks! I had exactly the same feeling and tried to modify this part as suggested. Unfortunately, it does result in (significantly) slower convergence in some instances, while not providing noticeable benefits. I don't have a rigorous explanation (the alg is a heuristic anyway), but here is my intuition: We update frequencies of blocks in some order, which is dictated by `ActiveSet` (currently that's simply a queue). This order does affect the speed of convergence: For example, we want to prioritize updates of frequencies of blocks that are a part of a hot loop. If at an iteration we modify frequency of `I`, then there is a higher chance that block `I` will need to be updated later. Thus, we explicitly add it to the queue so that it's updated again as soon as possible. There are likely alternative strategies here, e.g., having some priority-based queues and/or smarter strategy for deciding when `I` needs to be updated. I played quite a bit with various versions but couldn't get significant wins over the default (simplest) strategy. So let's keep this question as a future work. spupyrev: This is a very good question, thanks! I had exactly the same feeling and tried to modify this…
		hoyUnsubmitted Not Done Reply Inline Actions Thanks for the explanation. Looks like the processing order matters here but hard to track the exact order. Sounds good to keep the current implementation. hoy: Thanks for the explanation. Looks like the processing order matters here but hard to track the…
		IsActive[I] = true;
		for (size_t Succ : Successors[I]) {
		if (!IsActive[Succ]) {
		ActiveSet.push(Succ);
		IsActive[Succ] = true;
		}
		}
		}

		// Update the frequency for the block
		Freq[I] = NewFreq;
		}

		LLVM_DEBUG(dbgs() << " Completed " << It << " inference iterations"
		<< format(" (%0.0f per block)", double(It) / Freq.size())
		<< "\n");
		#ifndef NDEBUG
		LLVM_DEBUG(dbgs() << " Final discrepancy = "
		<< discrepancy(ProbMatrix, Freq).toString() << "\n");
		#endif
		}

		template <class BT>
		void BlockFrequencyInfoImpl<BT>::findReachableBlocks(
		std::vector<const BlockT *> &Blocks) const {
		// Find all blocks to apply inference on, that is, reachable from the entry
		// along edges with non-zero probablities
		std::queue<const BlockT *> Queue;
		std::unordered_set<const BlockT *> Reachable;
		const BlockT *Entry = &F->front();
		Queue.push(Entry);
		Reachable.insert(Entry);
		while (!Queue.empty()) {
		const BlockT *SrcBB = Queue.front();
		Queue.pop();
		for (const BlockT DstBB : children<const BlockT >(SrcBB)) {
		auto EP = BPI->getEdgeProbability(SrcBB, DstBB);
		if (EP.isZero())
		continue;
		if (Reachable.find(DstBB) == Reachable.end()) {
		Queue.push(DstBB);
		Reachable.insert(DstBB);
		}
		}
		}

		// Find all blocks to apply inference on, that is, backward reachable from
		// the entry along (backward) edges with non-zero probablities
		std::unordered_set<const BlockT *> InverseReachable;
		for (const BlockT &BB : *F) {
		// An exit block is a block without any successors
		bool HasSucc = GraphTraits<const BlockT *>::child_begin(&BB) !=
		GraphTraits<const BlockT *>::child_end(&BB);
		if (!HasSucc && Reachable.count(&BB)) {
		Queue.push(&BB);
		InverseReachable.insert(&BB);
		}
		}
		while (!Queue.empty()) {
		const BlockT *SrcBB = Queue.front();
		Queue.pop();
		for (const BlockT DstBB : children<Inverse<const BlockT >>(SrcBB)) {
		auto EP = BPI->getEdgeProbability(DstBB, SrcBB);
		if (EP.isZero())
		continue;
		if (InverseReachable.find(DstBB) == InverseReachable.end()) {
		Queue.push(DstBB);
		InverseReachable.insert(DstBB);
		}
		}
		}

		// Collect the result
		Blocks.reserve(F->size());
		for (const BlockT &BB : *F) {
		if (Reachable.count(&BB) && InverseReachable.count(&BB)) {
		Blocks.push_back(&BB);
		}
		}
		}

		template <class BT>
		void BlockFrequencyInfoImpl<BT>::initTransitionProbabilities(
		const std::vector<const BlockT *> &Blocks,
		const DenseMap<const BlockT *, size_t> &BlockIndex,
		ProbMatrixType &ProbMatrix) const {
		const size_t NumBlocks = Blocks.size();
		auto Succs = std::vector<std::vector<std::pair<size_t, Scaled64>>>(NumBlocks);
		auto SumProb = std::vector<Scaled64>(NumBlocks);

		// Find unique successors and corresponding probabilities for every block
		for (size_t Src = 0; Src < NumBlocks; Src++) {
		const BlockT *BB = Blocks[Src];
		std::unordered_set<const BlockT *> UniqueSuccs;
		for (const auto SI : children<const BlockT *>(BB)) {
		// Ignore cold blocks
		if (BlockIndex.find(SI) == BlockIndex.end())
		continue;
		// Ignore parallel edges between BB and SI blocks
		if (UniqueSuccs.find(SI) != UniqueSuccs.end())
		hoyUnsubmitted Not Done Reply Inline Actions Should the probability of parallel edges be accumulated? hoy: Should the probability of parallel edges be accumulated?
		spupyrevAuthorUnsubmitted Done Reply Inline Actions In my tests, I see parallel edges are always coming with exactly the same probability, and their sum might exceed 1.0. I guess that's an assumption/invariant used in BPI. spupyrev: In my tests, I see parallel edges are always coming with exactly the same probability, and…
		hoyUnsubmitted Not Done Reply Inline Actions You're right. `getEdgeProbability` returns the sum of all raw edge probabilities from Src to Dst. /// Get the raw edge probability calculated for the block pair. This returns the /// sum of all raw edge probabilities from Src to Dst. BranchProbability BranchProbabilityInfo::getEdgeProbability(const BasicBlock Src, const BasicBlock Dst) const { if (!Probs.count(std::make_pair(Src, 0))) return BranchProbability(llvm::count(successors(Src), Dst), succ_size(Src)); auto Prob = BranchProbability::getZero(); for (const_succ_iterator I = succ_begin(Src), E = succ_end(Src); I != E; ++I) if (I == Dst) Prob += Probs.find(std::make_pair(Src, I.getSuccessorIndex()))->second; return Prob; } hoy:* You're right. `getEdgeProbability` returns the sum of all raw edge probabilities from Src to…
		continue;
		UniqueSuccs.insert(SI);
		// Ignore jumps with zero probability
		auto EP = BPI->getEdgeProbability(BB, SI);
		if (EP.isZero())
		continue;

		auto EdgeProb =
		Scaled64::getFraction(EP.getNumerator(), EP.getDenominator());
		size_t Dst = BlockIndex.find(SI)->second;
		Succs[Src].push_back(std::make_pair(Dst, EdgeProb));
		SumProb[Src] += EdgeProb;
		}
		}

		// Add transitions for every jump with positive branch probability
		ProbMatrix = ProbMatrixType(NumBlocks);
		for (size_t Src = 0; Src < NumBlocks; Src++) {
		// Ignore blocks w/o successors
		if (Succs[Src].empty())
		continue;

		assert(!SumProb[Src].isZero() && "Zero sum probability of non-exit block");
		for (auto &Jump : Succs[Src]) {
		size_t Dst = Jump.first;
		Scaled64 Prob = Jump.second;
		ProbMatrix[Dst].push_back(std::make_pair(Src, Prob / SumProb[Src]));
		}
		}

		// Add transitions from sinks to the source
		size_t EntryIdx = BlockIndex.find(&F->front())->second;
		for (size_t Src = 0; Src < NumBlocks; Src++) {
		if (Succs[Src].empty()) {
		ProbMatrix[EntryIdx].push_back(std::make_pair(Src, Scaled64::getOne()));
		}
		}
		}

		#ifndef NDEBUG
		template <class BT>
		BlockFrequencyInfoImplBase::Scaled64 BlockFrequencyInfoImpl<BT>::discrepancy(
		const ProbMatrixType &ProbMatrix, const std::vector<Scaled64> &Freq) const {
		assert(Freq[0] > 0 && "Incorrectly computed frequency of the entry block");
		Scaled64 Discrepancy;
		for (size_t I = 0; I < ProbMatrix.size(); I++) {
		Scaled64 Sum;
		for (const auto &Jump : ProbMatrix[I]) {
		Sum += Freq[Jump.first] * Jump.second;
		}
		Discrepancy += Freq[I] >= Sum ? Freq[I] - Sum : Sum - Freq[I];
		}
		// Normalizing by the frequency of the entry block
		return Discrepancy / Freq[0];
		}
		#endif

/// \note This should be a lambda, but that crashes GCC 4.7.		/// \note This should be a lambda, but that crashes GCC 4.7.
namespace bfi_detail {		namespace bfi_detail {

template <class BT> struct BlockEdgesAdder {		template <class BT> struct BlockEdgesAdder {
using BlockT = BT;		using BlockT = BT;
using LoopData = BlockFrequencyInfoImplBase::LoopData;		using LoopData = BlockFrequencyInfoImplBase::LoopData;
using Successor = GraphTraits<const BlockT *>;		using Successor = GraphTraits<const BlockT *>;

▲ Show 20 Lines • Show All 269 Lines • Show Last 20 Lines

llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp

	Show All 40 Lines
	#define DEBUG_TYPE "block-freq"			#define DEBUG_TYPE "block-freq"

	namespace llvm {			namespace llvm {
	cl::opt<bool> CheckBFIUnknownBlockQueries(			cl::opt<bool> CheckBFIUnknownBlockQueries(
	"check-bfi-unknown-block-queries",			"check-bfi-unknown-block-queries",
	cl::init(false), cl::Hidden,			cl::init(false), cl::Hidden,
	cl::desc("Check if block frequency is queried for an unknown block "			cl::desc("Check if block frequency is queried for an unknown block "
	"for debugging missed BFI updates"));			"for debugging missed BFI updates"));

				cl::opt<bool> UseIterativeBFIInference(
				"use-iterative-bfi-inference", cl::init(false), cl::Hidden,
				cl::desc("Apply an iterative post-processing to infer correct BFI counts"));

				cl::opt<unsigned> IterativeBFIMaxIterationsPerBlock(
				"iterative-bfi-max-iterations-per-block", cl::init(1000), cl::Hidden,
				cl::desc("Iterative inference: maximum number of update iterations "
				"per block"));

				cl::opt<double> IterativeBFIPrecision(
				"iterative-bfi-precision", cl::init(1e-12), cl::Hidden,
				wenleiUnsubmitted Done Reply Inline Actions perhaps `iterative-bfi-precision` or something alike is more reflective of what it does? It'd be helpful to mention somewhere in the comment or description the trade off between precision and run time (iterations needed to converge). wenlei: perhaps `iterative-bfi-precision` or something alike is more reflective of what it does? It'd…
				cl::desc("Iterative inference: delta convergence precision; smaller values "
				"typically lead to better results at the cost of worsen runtime"));
	}			}

	ScaledNumber<uint64_t> BlockMass::toScaled() const {			ScaledNumber<uint64_t> BlockMass::toScaled() const {
	if (isFull())			if (isFull())
	return ScaledNumber<uint64_t>(1, 0);			return ScaledNumber<uint64_t>(1, 0);
	return ScaledNumber<uint64_t>(getMass() + 1, -64);			return ScaledNumber<uint64_t>(getMass() + 1, -64);
	}			}

	▲ Show 20 Lines • Show All 812 Lines • Show Last 20 Lines

llvm/test/Transforms/SampleProfile/Inputs/profile-correlation-irreducible-loops.prof

This file was added.

				yyparse_1:10822:0
				1: 1
				2: 1003
				3: 1002
				4: 1002
				5: 1
				6: 0
				7: 1
				8: 0
				9: 1
				!CFGChecksum: 158496288380146391

				foo1:2297361:0
				1: 1
				2: 86
				3: 8212
				4: 1
				5: 17747
				!CFGChecksum: 404850113186107133

llvm/test/Transforms/SampleProfile/profile-correlation-irreducible-loops.ll

This file was added.

				; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/profile-correlation-irreducible-loops.prof \| opt -analyze -block-freq -enable-new-pm=0 -use-iterative-bfi-inference \| FileCheck %s
				; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/profile-correlation-irreducible-loops.prof -S \| FileCheck %s --check-prefix=CHECK2
				davidxlUnsubmitted Not Done Reply Inline Actions why -enable-new-pm = 0? davidxl: why -enable-new-pm = 0?
				spupyrevAuthorUnsubmitted Done Reply Inline Actions Without the option, I get Cannot specify -analyze under new pass manager, either specify '-enable-new-pm=0', or use the corresponding new pass manager pass, e.g. '-passes=print<scalar-evolution>'. For a full list of passes, see the '--print-passes' flag. spupyrev: Without the option, I get ``` Cannot specify -analyze under new pass manager, either specify '…
				; RUN: opt < %s -analyze -block-freq -enable-new-pm=0 -use-iterative-bfi-inference \| FileCheck %s --check-prefix=CHECK3
				hoyUnsubmitted Done Reply Inline Actions The pseudo-probe pass is probably not needed since the test IR comes with pseudo probes. hoy: The pseudo-probe pass is probably not needed since the test IR comes with pseudo probes.

				; The C++ code for this test case is from c-parse.c in 403.gcc (SPEC2006)
				; The problem with BFI for the test is solved by applying iterative inference.
				; The corresponding CFG graph is shown below, with intended counts for every
				; basic block. The hot loop, b3->b4->b2, is not getting proper (large) counts
				; unless the -use-iterative-bfi-inference option is specified.
				;
				; +-------------------------------------------+
				davidxlUnsubmitted Done Reply Inline Actions It will be helpful to draw a simple text art CFG to demonstrate the expected bb counts. davidxl: It will be helpful to draw a simple text art CFG to demonstrate the expected bb counts.
				; \| \|
				; \| +----------+ \|
				; \| \| b1 [1] \| \|
				; \| +----------+ \|
				; \| \| \|
				; \| \| \|
				; \| v \|
				; \| +----------+ \|
				; \| +------------> \| b2 [625] \| -+ \|
				; \| \| +----------+ \| \|
				; \| \| \| \| \|
				; \| \| \| \| \|
				; \| \| v \| \|
				; \| +----------+ +----------+ \| \|
				; \| \| b4 [624] \| <-- \| b3 [625] \| <+---------+
				; \| +----------+ +----------+ \|
				; \| \| \|
				; +----+ \| \|
				; \| v v
				; +----------+ +--------------------+
				; \| b8 [1] \| <-- \| b7 [2] \|
				; +----------+ +--------------------+
				; \| ^
				; \| \|
				; v \|
				; +----------+ +----------+ \|
				; \| b9 [1] \| <-- \| b5 [2] \| \|
				; +----------+ +----------+ \|
				; \| \|
				; \| \|
				; v \|
				; +----------+ \|
				; \| b6 [1] \| -+
				; +----------+

				@yydebug = dso_local global i32 0, align 4

				; Function Attrs: noinline nounwind uwtable
				define dso_local i32 @yyparse_1() #0 {
				b1:
				call void @llvm.pseudoprobe(i64 -7702751003264189226, i64 1, i32 0, i64 -1)
				%0 = load i32, i32* @yydebug, align 4
				%cmp = icmp ne i32 %0, 0
				br label %b2
				; CHECK: - b1: float = {{.}}, int = {{.}}, count = 1

				b2:
				call void @llvm.pseudoprobe(i64 -7702751003264189226, i64 2, i32 0, i64 -1)
				br i1 %cmp, label %b7, label %b3
				; CHECK: - b2: float = {{.}}, int = {{.}}, count = 625

				b3:
				call void @llvm.pseudoprobe(i64 -7702751003264189226, i64 3, i32 0, i64 -1)
				br i1 %cmp, label %b7, label %b4
				; CHECK: - b3: float = {{.}}, int = {{.}}, count = 625
				; CHECK2: br i1 %cmp, label %b7, label %b4,
				; CHECK2-SAME: !prof ![[END172_PROF:[0-9]+]]

				b4:
				call void @llvm.pseudoprobe(i64 -7702751003264189226, i64 4, i32 0, i64 -1)
				br label %b2
				; CHECK: - b4: float = {{.}}, int = {{.}}, count = 624

				b5:
				call void @llvm.pseudoprobe(i64 -7702751003264189226, i64 5, i32 0, i64 -1)
				br i1 %cmp, label %b9, label %b6
				; CHECK: - b5: float = {{.}}, int = {{.}}, count = 2

				b6:
				call void @llvm.pseudoprobe(i64 -7702751003264189226, i64 6, i32 0, i64 -1)
				br label %b7
				; CHECK: - b6: float = {{.}}, int = {{.}}, count = 1

				b7:
				call void @llvm.pseudoprobe(i64 -7702751003264189226, i64 7, i32 0, i64 -1)
				br i1 %cmp, label %b5, label %b8
				; CHECK: - b7: float = {{.}}, int = {{.}}, count = 2
				; CHECK2: br i1 %cmp, label %b5, label %b8,
				; CHECK2-SAME: !prof ![[FALSE4858_PROF:[0-9]+]]

				b8:
				call void @llvm.pseudoprobe(i64 -7702751003264189226, i64 8, i32 0, i64 -1)
				br label %b3
				; CHECK: - b8: float = {{.}}, int = {{.}}, count = 1

				b9:
				call void @llvm.pseudoprobe(i64 -7702751003264189226, i64 9, i32 0, i64 -1)
				%1 = load i32, i32* @yydebug, align 4
				ret i32 %1
				; CHECK: - b9: float = {{.}}, int = {{.}}, count = 1

				}

				; Another difficult (for BFI) instance with irreducible loops,
				; containing 'indirectbr'. The corresponding CFG graph is shown below, with
				; intended counts for every basic block.
				;
				; +-----------+
				; \| b1 [1] \|
				; +-----------+
				; \|
				; \|
				; v
				; +------------------------+
				; +- \| b2 [86] \| <+
				; \| +------------------------+ \|
				; \| \| \| \|
				; \| \| \| \|
				; \| v \| \|
				; \| +-----------+ \| \|
				; \| \| b3 [8212] \| <+-------+ \|
				; \| +-----------+ \| \| \|
				; \| \| \| \| \|
				; \| \| \| \| \|
				; \| v v \| \|
				; \| +------------------------+ \|
				; \| \| indirectgoto [17747] \| -+
				; \| +------------------------+
				; \| \| ^ \|
				; \| \| +--+
				; \| v
				; \| +-----------+
				; +> \| b4 [1] \|
				; +-----------+

				; Function Attrs: nounwind uwtable
				define dso_local i32 @foo1() #0 !prof !132 {
				b1:
				call void @llvm.pseudoprobe(i64 7682762345278052905, i64 1, i32 0, i64 -1)
				%0 = load i32, i32* @yydebug, align 4
				%cmp = icmp ne i32 %0, 0
				br label %b2
				; CHECK3: - b1: float = {{.}}, int = {{.}}, count = 1

				b2:
				call void @llvm.pseudoprobe(i64 7682762345278052905, i64 2, i32 0, i64 -1)
				%1 = load i32, i32* @yydebug, align 4
				switch i32 %1, label %b4 [
				i32 1, label %indirectgoto
				i32 2, label %b3
				], !prof !133
				; CHECK3: - b2: float = {{.}}, int = {{.}}, count = 86

				b3:
				call void @llvm.pseudoprobe(i64 7682762345278052905, i64 3, i32 0, i64 -1)
				br label %indirectgoto
				; CHECK3: - b3: float = {{.}}, int = {{.}}, count = 8212

				b4:
				call void @llvm.pseudoprobe(i64 7682762345278052905, i64 4, i32 0, i64 -1)
				%2 = load i32, i32* @yydebug, align 4
				ret i32 %2
				; CHECK3: - b4: float = {{.}}, int = {{.}}, count = 1

				indirectgoto:
				%indirect.goto.dest = alloca i8, align 4
				call void @llvm.pseudoprobe(i64 7682762345278052905, i64 5, i32 0, i64 -1)
				indirectbr i8* %indirect.goto.dest, [label %b2, label %indirectgoto, label %b4, label %b3], !prof !134
				; CHECK3: - indirectgoto: float = {{.}}, int = {{.}}, count = 17747

				}

				declare void @llvm.pseudoprobe(i64, i64, i32, i64) #1

				attributes #0 = { noinline nounwind uwtable "use-sample-profile"}
				attributes #1 = { nounwind }

				!llvm.pseudo_probe_desc = !{!1079, !4496}
				!1079 = !{i64 -7702751003264189226, i64 158496288380146391, !"yyparse_1", null}
				!4496 = !{i64 7682762345278052905, i64 404850113186107133, !"foo1", null}
				!132 = !{!"function_entry_count", i64 1}
				!133 = !{!"branch_weights", i32 0, i32 86, i32 0}
				!134 = !{!"branch_weights", i32 85, i32 9449, i32 1, i32 8212}

				; CHECK2: ![[END172_PROF]] = !{!"branch_weights", i32 1, i32 1003}
				; CHECK2: ![[FALSE4858_PROF]] = !{!"branch_weights", i32 2, i32 1}

This is an archive of the discontinued LLVM Phabricator instance.

A post-processing for BFI inferenceClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 351630

llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h

llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp

llvm/test/Transforms/SampleProfile/Inputs/profile-correlation-irreducible-loops.prof

llvm/test/Transforms/SampleProfile/profile-correlation-irreducible-loops.ll

A post-processing for BFI inference
ClosedPublic