This is an archive of the discontinued LLVM Phabricator instance.

[JumpThreading] Only write back branch-weight MDs for blocks that originally had PGO info
ClosedPublic

Authored by anemet on Aug 31 2016, 8:02 PM.

Download Raw Diff

Details

Reviewers

chandlerc
davidxl
sanjoy
hfinkel
dexonsmith

Commits

rGc520822dbfc2: [JumpThreading] Only write back branch-weight MDs for blocks that originally…
rL280713: [JumpThreading] Only write back branch-weight MDs for blocks that originally…

Summary

Currently the pass updates branch weights in the IR if the function has
any PGO info (entry frequency is set). However we could still have
regions of the CFG that does not have branch weights collected (e.g. a
cold region). In this case we'd use static estimates. Since static
estimates for branches are determined independently, they are
inconsistent. Updating them can "randomly" inflate block frequencies.

I've run into this in a completely cold loop of h264ref from
SPEC. -Rpass-with-hotness showed the loop to be completely cold during
inlining (before JT) but completely hot during vectorization (after JT).

The new testcase demonstrate the problem. We check array elements
against 1, 2 and 3 in a loop. The check against 3 is the loop-exiting
check. The block names should be self-explanatory.

In this example, jump threading incorrectly updates the weight of the
loop-exiting branch to 0, drastically inflating the frequency of the
loop (in the range of billions).

There is no run-time profile info for edges inside the loop, so branch
probabilities are estimated. These are the resulting branch and block
frequencies for the loop body:

         check_1 (16)
     (8) /  |
     eq_1   | (8)
         \  |
         check_2 (16)
     (8) /  |
     eq_2   | (8)
         \  |
         check_3 (16)
     (1) /  |
(loop exit) | (15)
            |
       (back edge)

First we thread eq_1 -> check_2 to check_3. Frequencies are updated to
remove the frequency of eq_1 from check_2 and then from the false edge
leaving check_2. Changed frequencies are highlighted with * *:

         check_1 (16)
     (8) /  |
     eq_1   | (8)
    /       |
   /     check_2 (*8*)
  /  (8) /  |
  \  eq_2   | (*0*)
   \     \  |
    ` --- check_3 (16)
     (1) /  |
(loop exit) | (15)
            |
       (back edge)

Next we thread eq_1 -> check_3 and eq_2 -> check_3 to check_1 as new
back edges. Frequencies are updated to remove the frequency of eq_1 and
eq_3 from check_3 and then the false edge leaving check_3 (changed
frequencies are highlighted with * *):

                check_1 (16)
            (8) /  |
            eq_1   | (8)
           /       |
          /     check_2 (*8*)
         /  (8) /  |
        /-- eq_2   | (*0*)
(back edge)        |
                check_3 (*0*)
          (*0*) /  |
       (loop exit) | (*0*)
                   |
              (back edge)

As a result, the loop exit edge ends up with 0 frequency which in turn makes
the loop header to have maximum frequency.

There are a few potential problems here:

The profile data seems odd. There is a single profile sample of the

loop being entered. On the other hand, there are no weights inside the
loop.

Based on static estimation we shouldn't set edges to "extreme"

values, i.e. extremely likely or unlikely.

We shouldn't create profile metadata that is calculated from static

estimation. I am not sure what policy is but it seems to make sense to
treat profile metadata as something that is known to originate from
profiling. Estimated probabilities should only be reflected in BPI/BFI.

Any one of these would probably fix the immediate problem. I went for 3
because I think it's a good policy to have and added a FIXME about 2.

Diff Detail

Event Timeline

anemet updated this revision to Diff 69943.Aug 31 2016, 8:02 PM

anemet retitled this revision from to [JumpThreading] Only write back branch-weight MDs for blocks that originally had PGO info.

anemet updated this object.

anemet added reviewers: davidxl, hfinkel, dexonsmith, sanjoy, chandlerc.

anemet added a subscriber: llvm-commits.

anemet mentioned this in D24144: [CFGPrinter] Display branch weight on the edges.Sep 1 2016, 10:22 AM

Can you show me an example of

"The profile data seems odd. There is a single profile sample of the
loop being entered. On the other hand, there are no weights inside the
loop."

I want to know what is going on? Are your using FE PGO or IR PGO?

In D24118#531735, @davidxl wrote:

Can you show me an example of

"The profile data seems odd. There is a single profile sample of the
loop being entered. On the other hand, there are no weights inside the
loop."

I want to know what is going on? Are your using FE PGO or IR PGO?

It's FE-based.

The CFG of the function with the branch weights from the IR looks like this: https://reviews.llvm.org/F2391730. I believe the IR was produced with -disable-llvm-optzns.

I can also send you the PGO data file privately if you're interested.

There is a fundamental problem in BFI that it can not handle 0 weight -- to workaround it

FE PGO annotator will always add 1 to weights of both targets unconditionally when annotating the branch
BFI always add 1 to the weight if it is zero.

The end result is that

we will never see code region annoated with zero frequency/count
for FE PGO, all loop trip count appears to be half of the real trip count.

I will try to improve this situation independently.

davidxl added inline comments.Sep 1 2016, 3:41 PM

lib/Transforms/Scalar/JumpThreading.cpp
1633	Check the 'branch_weight' string.

anemet mentioned this in rL280442: [CFGPrinter] Display branch weight on the edges.Sep 1 2016, 5:36 PM

In D24118#532156, @davidxl wrote:

There is a fundamental problem in BFI that it can not handle 0 weight -- to workaround it

FE PGO annotator will always add 1 to weights of both targets unconditionally when annotating the branch

BFI always add 1 to the weight if it is zero.

The end result is that

we will never see code region annoated with zero frequency/count

Ah, that totally explains what's happening here. Is there a PR?

for FE PGO, all loop trip count appears to be half of the real trip count.

Wow, this was going to be next thing for me to investigate, thanks for the insights!

There is no PR for the first problem.

For the loop trip count problem, see https://llvm.org/bugs/show_bug.cgi?id=27791. I have cc'ed you in that bug.

The right solution for problem #1 is

in BPI, when reading meta data to create BP, if there is zero weight, BPI should assign an extreme branch probability value instead of using zero
in BFI, remove the code that add 1 to the zero weight
In FE PGO, do not add one to weight.

The handling in 2) is not ideal for a very special case -- loop with exit edge not executed (because for instance call to no return function). In such as, BFI will still produce huge frequency for the loop body .. A real solution may require PGOUse to annotate the loop body with the real count of the loop header.

Addressed David's comment.

lgtm with some test change suggestion.

test/Transforms/JumpThreading/static-profile.ll
79	Perhaps check threading happens here and this branch has no meta data CHECK: br .... check2_thread, .... label %check_2$ CHECK-NEXT: check_2.thread:
87	similarly as above : check threading happens and no meta data is annotated here.

This revision is now accepted and ready to land.Sep 3 2016, 12:25 PM

anemet added inline comments.Sep 6 2016, 9:09 AM

test/Transforms/JumpThreading/static-profile.ll
79	Right! Thanks for the review!

Closed by commit rL280713: [JumpThreading] Only write back branch-weight MDs for blocks that originally… (authored by anemet). · Explain WhySep 6 2016, 9:17 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

Transforms/

Scalar/

JumpThreading.h

2 lines

lib/

Transforms/

Scalar/

JumpThreading.cpp

53 lines

test/

Transforms/

JumpThreading/

static-profile.ll

106 lines

Diff 70206

include/llvm/Transforms/Scalar/JumpThreading.h

Show First 20 Lines • Show All 128 Lines • ▼ Show 20 Lines	public:
bool TryToUnfoldSelect(CmpInst CondCmp, BasicBlock BB);		bool TryToUnfoldSelect(CmpInst CondCmp, BasicBlock BB);
bool TryToUnfoldSelectInCurrBB(BasicBlock *BB);		bool TryToUnfoldSelectInCurrBB(BasicBlock *BB);

private:		private:
BasicBlock SplitBlockPreds(BasicBlock BB, ArrayRef<BasicBlock *> Preds,		BasicBlock SplitBlockPreds(BasicBlock BB, ArrayRef<BasicBlock *> Preds,
const char *Suffix);		const char *Suffix);
void UpdateBlockFreqAndEdgeWeight(BasicBlock PredBB, BasicBlock BB,		void UpdateBlockFreqAndEdgeWeight(BasicBlock PredBB, BasicBlock BB,
BasicBlock NewBB, BasicBlock SuccBB);		BasicBlock NewBB, BasicBlock SuccBB);
		/// Check if the block has profile metadata for its outgoing edges.
		bool doesBlockHaveProfileData(BasicBlock *BB);
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

lib/Transforms/Scalar/JumpThreading.cpp

Show First 20 Lines • Show All 1,614 Lines • ▼ Show 20 Lines	BasicBlock JumpThreadingPass::SplitBlockPreds(BasicBlock BB,

// Set the block frequency of the newly created PredBB, which is the sum of		// Set the block frequency of the newly created PredBB, which is the sum of
// frequencies of Preds.		// frequencies of Preds.
if (HasProfileData)		if (HasProfileData)
BFI->setBlockFreq(PredBB, PredBBFreq.getFrequency());		BFI->setBlockFreq(PredBB, PredBBFreq.getFrequency());
return PredBB;		return PredBB;
}		}

		bool JumpThreadingPass::doesBlockHaveProfileData(BasicBlock *BB) {
		const TerminatorInst *TI = BB->getTerminator();
		assert(TI->getNumSuccessors() > 1 && "not a split");

		MDNode *WeightsNode = TI->getMetadata(LLVMContext::MD_prof);
		if (!WeightsNode)
		return false;

		MDString *MDName = cast<MDString>(WeightsNode->getOperand(0));
		if (MDName->getString() != "branch_weights")
		return false;
		davidxlUnsubmitted Not Done Reply Inline Actions Check the 'branch_weight' string. davidxl: Check the 'branch_weight' string.

		// Ensure there are weights for all of the successors. Note that the first
		// operand to the metadata node is a name, not a weight.
		return WeightsNode->getNumOperands() == TI->getNumSuccessors() + 1;
		}

/// Update the block frequency of BB and branch weight and the metadata on the		/// Update the block frequency of BB and branch weight and the metadata on the
/// edge BB->SuccBB. This is done by scaling the weight of BB->SuccBB by 1 -		/// edge BB->SuccBB. This is done by scaling the weight of BB->SuccBB by 1 -
/// Freq(PredBB->BB) / Freq(BB->SuccBB).		/// Freq(PredBB->BB) / Freq(BB->SuccBB).
void JumpThreadingPass::UpdateBlockFreqAndEdgeWeight(BasicBlock *PredBB,		void JumpThreadingPass::UpdateBlockFreqAndEdgeWeight(BasicBlock *PredBB,
BasicBlock *BB,		BasicBlock *BB,
BasicBlock *NewBB,		BasicBlock *NewBB,
BasicBlock *SuccBB) {		BasicBlock *SuccBB) {
if (!HasProfileData)		if (!HasProfileData)
Show All 34 Lines	else {
BranchProbability::normalizeProbabilities(BBSuccProbs.begin(),		BranchProbability::normalizeProbabilities(BBSuccProbs.begin(),
BBSuccProbs.end());		BBSuccProbs.end());
}		}

// Update edge probabilities in BPI.		// Update edge probabilities in BPI.
for (int I = 0, E = BBSuccProbs.size(); I < E; I++)		for (int I = 0, E = BBSuccProbs.size(); I < E; I++)
BPI->setEdgeProbability(BB, I, BBSuccProbs[I]);		BPI->setEdgeProbability(BB, I, BBSuccProbs[I]);

if (BBSuccProbs.size() >= 2) {		// Update the profile metadata as well.
		//
		// Don't do this if the profile of the transformed blocks was statically
		// estimated. (This could occur despite the function having an entry
		// frequency in completely cold parts of the CFG.)
		//
		// In this case we don't want to suggest to subsequent passes that the
		// calculated weights are fully consistent. Consider this graph:
		//
		// check_1
		// 50% / \|
		// eq_1 \| 50%
		// \ \|
		// check_2
		// 50% / \|
		// eq_2 \| 50%
		// \ \|
		// check_3
		// 50% / \|
		// eq_3 \| 50%
		// \ \|
		//
		// Assuming the blocks check_* all compare the same value against 1, 2 and 3,
		// the overall probabilities are inconsistent; the total probability that the
		// value is either 1, 2 or 3 is 150%.
		//
		// As a consequence if we thread eq_1 -> check_2 to check_3, check_2->check_3
		// becomes 0%. This is even worse if the edge whose probability becomes 0% is
		// the loop exit edge. Then based solely on static estimation we would assume
		// the loop was extremely hot.
		//
		// FIXME this locally as well so that BPI and BFI are consistent as well. We
		// shouldn't make edges extremely likely or unlikely based solely on static
		// estimation.
		if (BBSuccProbs.size() >= 2 && doesBlockHaveProfileData(BB)) {
SmallVector<uint32_t, 4> Weights;		SmallVector<uint32_t, 4> Weights;
for (auto Prob : BBSuccProbs)		for (auto Prob : BBSuccProbs)
Weights.push_back(Prob.getNumerator());		Weights.push_back(Prob.getNumerator());

auto TI = BB->getTerminator();		auto TI = BB->getTerminator();
TI->setMetadata(		TI->setMetadata(
LLVMContext::MD_prof,		LLVMContext::MD_prof,
MDBuilder(TI->getParent()->getContext()).createBranchWeights(Weights));		MDBuilder(TI->getParent()->getContext()).createBranchWeights(Weights));
▲ Show 20 Lines • Show All 287 Lines • Show Last 20 Lines

test/Transforms/JumpThreading/static-profile.ll

This file was added.

				; RUN: opt -S -jump-threading < %s \| FileCheck %s

				; Check that based solely on static profile estimation we don't update the
				; branch-weight metadata. Even if the function has an entry frequency, a
				; completely cold part of the CFG may be statically estimated.

				; For example in the loop below, jump threading would update the weight of the
				; loop-exiting branch to 0, drastically inflating the frequency of the loop
				; (in the range of billions).
				;
				; This is the CFG of the loop. There is no run-time profile info for edges
				; inside the loop, so branch and block frequencies are estimated as shown:
				;
				; check_1 (16)
				; (8) / \|
				; eq_1 \| (8)
				; \ \|
				; check_2 (16)
				; (8) / \|
				; eq_2 \| (8)
				; \ \|
				; check_3 (16)
				; (1) / \|
				; (loop exit) \| (15)
				; \|
				; (back edge)
				;
				; First we thread eq_1->check_2 to check_3. Frequencies are updated to remove
				; the frequency of eq_1 from check_2 and then the false edge leaving check_2
				; (changed frequencies are highlighted with * *):
				;
				; check_1 (16)
				; (8) / \|
				; eq_1 \| (8)
				; / \|
				; / check_2 (8)
				; / (8) / \|
				; \ eq_2 \| (0)
				; \ \ \|
				; ` --- check_3 (16)
				; (1) / \|
				; (loop exit) \| (15)
				; \|
				; (back edge)
				;
				; Next we thread eq_1->check_3 and eq_2->check_3 to check_1 as new back edges.
				; Frequencies are updated to remove the frequency of eq_1 and eq_3 from
				; check_3 and then the false edge leaving check_3 (changed frequencies are
				; highlighted with * *):
				;
				; check_1 (16)
				; (8) / \|
				; eq_1 \| (8)
				; / \|
				; / check_2 (8)
				; / (8) / \|
				; /-- eq_2 \| (0)
				; (back edge) \|
				; check_3 (0)
				; (0) / \|
				; (loop exit) \| (0)
				; \|
				; (back edge)
				;
				; As a result, the loop exit edge ends up with 0 frequency which in turn makes
				; the loop header to have maximum frequency.

				declare void @bar()

				define void @foo(i32 *%p, i32 %n) !prof !0 {
				entry:
				%enter_loop = icmp eq i32 %n, 0
				br i1 %enter_loop, label %exit, label %check_1, !prof !1
				; CHECK: br i1 %enter_loop, label %exit, label %check_1, !prof !1

				check_1:
				%v = load i32, i32* %p
				%cond1 = icmp eq i32 %v, 1
				br i1 %cond1, label %eq_1, label %check_2
				davidxlUnsubmitted Not Done Reply Inline Actions Perhaps check threading happens here and this branch has no meta data CHECK: br .... check2_thread, .... label %check_2$ CHECK-NEXT: check_2.thread: davidxl: Perhaps check threading happens here and this branch has no meta data CHECK: br ....
				anemetAuthorUnsubmitted Not Done Reply Inline Actions Right! Thanks for the review! anemet: Right! Thanks for the review!

				eq_1:
				call void @bar()
				br label %check_2

				check_2:
				%cond2 = icmp eq i32 %v, 2
				br i1 %cond2, label %eq_2, label %check_3
				davidxlUnsubmitted Not Done Reply Inline Actions similarly as above : check threading happens and no meta data is annotated here. davidxl: similarly as above : check threading happens and no meta data is annotated here.

				eq_2:
				call void @bar()
				br label %check_3

				check_3:
				%condE = icmp eq i32 %v, 3
				br i1 %condE, label %exit, label %check_1
				; CHECK: br i1 %condE, label %exit, label %check_1{{$}}

				exit:
				ret void
				}

				!0 = !{!"function_entry_count", i64 120}
				; CHECK-NOT: branch_weights
				!1 = !{!"branch_weights", i32 119, i32 1}
				; CHECK: !1 = !{!"branch_weights", i32 119, i32 1}
				; CHECK-NOT: branch_weights

This is an archive of the discontinued LLVM Phabricator instance.

[JumpThreading] Only write back branch-weight MDs for blocks that originally had PGO infoClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 70206

include/llvm/Transforms/Scalar/JumpThreading.h

lib/Transforms/Scalar/JumpThreading.cpp

test/Transforms/JumpThreading/static-profile.ll

[JumpThreading] Only write back branch-weight MDs for blocks that originally had PGO info
ClosedPublic