Download Raw Diff

Details

Reviewers

chandlerc
djasper
davidxl
dexonsmith

Commits

rG41cf1a5dfb49: Improving edge probabilities computation when choosing the best successor in…
rL253414: Improving edge probabilities computation when choosing the best successor in…

Summary

When looking for the best successor from the outer loop for a block belonging to an inner loop, the edge probability computation can be improved so that edges in the inner loop are ignored. For example, suppose we are building chains for the non-loop part of the following code, and looking for B1's best successor. Assume the true body is very hot, then B3 should be the best candidate. However, because of the existence of the back edge from B1 to B0, the probability from B1 to B3 can be very small, preventing B3 to be its successor. In this patch, when computing the probability of the edge from B1 to B3, the weight on the back edge B1->B0 is ignored, so that B1->B3 will have 100% probability.

if (...)

do {
  B0;
  ... // some branches
  B1;
} while(...);

else

B2;

B3;

Diff Detail

Repository: rL LLVM

Event Timeline

congh updated this revision to Diff 28732.Jun 29 2015, 4:53 PM

congh retitled this revision from to Improvement on computing edge probabilities when choosing the best successor in machine block placement. .

congh updated this object.

congh edited the test plan for this revision. (Show Details)

congh added a reviewer: chandlerc.

congh added a subscriber: Unknown Object (MLST).

congh added a reviewer: davidxl.Jun 29 2015, 5:35 PM

Looks reasonable -- as the already laid-out inner loop should be treated as a 'collapsed' node.

lib/CodeGen/MachineBlockPlacement.cpp
392 ↗	(On Diff #28732)	I think it is better to provide a wrapper method to compute getSumForBlock (BB). In the wrapper method, exclude edge whose destination is in the same loop chain as BB.

Update the patch to add context lines.

Ping?

Fixed a bug in the patch.

davidxl added inline comments.Jul 23 2015, 12:30 AM

lib/CodeGen/MachineBlockPlacement.cpp
367 ↗	(On Diff #30426)	Do SumWeight adjustment here -- outside the main loop: uint32_t SumWeight = ... SumWeight = AdjustWeightSumExcludingInternalBackEdge(SumWeight, BB);
386 ↗	(On Diff #30426)	Do you have examples of the problem when chain formation is for a 'peer' loop?
393 ↗	(On Diff #30426)	need to check SuccLoop !=0 ?
393 ↗	(On Diff #30426)	It is more readable to move the Sumweight adjustment loop outside the main for loop and probably in a helper function.
397 ↗	(On Diff #30426)	No need to change SuccProb to SuccProbInLoop.
402 ↗	(On Diff #30426)	No need for var name change.
434 ↗	(On Diff #30426)	Change SuccProb to RealSuccProb

congh marked 3 inline comments as done.Jul 23 2015, 11:03 AM

congh added inline comments.

lib/CodeGen/MachineBlockPlacement.cpp
367 ↗	(On Diff #30426)	The adjustment on SumWeight is done only when it has a successor outside of its loop that is also in BlockFilter (if the successor outside of the loop is not in BlockFilter, we are building chains for the loop). It makes more sense to calculate this adjusted weight inside of this loop. Note that when we build chains for the loop, we don't need this adjusted weight at all.
386 ↗	(On Diff #30426)	The following partial CFG shows such a case: A ⇵ B ↓ C ⇵ D Here, B->C is an edge between two peer loops.
393 ↗	(On Diff #30426)	SuccLoop can be 0 when the successor is not in any loop.

davidxl added inline comments.Jul 23 2015, 12:03 PM

lib/CodeGen/MachineBlockPlacement.cpp

367 ↗

(On Diff #30426)

I think the adjustment can be done without even needing to have a top level for loop like

for (auto Succ: BB->successors()) {

if (...) continue;
for (auto SuccBB : BB->successors())

if (MLI->getLoopFor(SuccBB) == L)

		          SumWeight2 -= MBPI->getEdgeWeight(BB, SuccBB) / WeightScale;
    ..

}

but simply:

for (auto SuccBB : BB->successors())

        // We find out that SuccBB is already in 'Chain', but it is not the end of 'Chain' -- it must be a block in the inner 
        // loop including BB that has already be laid out. Skip it 
        if (BlockToChain(SuccBB) == &Chain)   
		   SumWeight -= MBPI->getEdgeWeight(BB, SuccBB) / WeightScale;

386 ↗

(On Diff #30426)

Ok. A test case would be good.

congh added inline comments.Jul 23 2015, 1:45 PM

lib/CodeGen/MachineBlockPlacement.cpp
367 ↗	(On Diff #30426)	But if the first successor in BB->successors() is not in the same loop as BB, we won't get the correct SumWeight in this way. As most CFG nodes have at most two outgoing edges, we won't get many repeated computations.
386 ↗	(On Diff #30426)	I think my comment here is incorrect. In the case above, we are not building chains for the peer loop but the outer loop, while the successor belongs to a peer loop. I will correct the comment.
386 ↗	(On Diff #30426)	OK. Will add such a case.

davidxl added inline comments.Jul 23 2015, 2:25 PM

lib/CodeGen/MachineBlockPlacement.cpp
367 ↗	(On Diff #30426)	If the successor of BB is not in the same loop as BB but matches the condition I listed, there are two cases: the succBB is in an inner loop of BB's loop, in this case we want to skip this SuccBB the succBB is in the current loop being laid out. If succBB satisfies the condition, this block must be a hotter block that is laid out before 'BB' with top-order broken. In this case, edge weight (BB->SuccBB) should also be skipped -- this is not any different from the backedge case.

congh added inline comments.Jul 23 2015, 2:47 PM

lib/CodeGen/MachineBlockPlacement.cpp
367 ↗	(On Diff #30426)	Now I understand your suggested simple loop: you are putting that simple loop before the outer for loop instead of inserting the if statement inside of it. Considering the case 2 above, I think now it makes more sense to hoist this sum weight update outside of the loop. I will update the patch accordingly. Thanks for the advice!

Update the patch.

When looking for the best successor of a block, now we only consider its valid successors that could become its successor in the chain, and calculate the sum of weights from those valid successors which is later used to calculate edge probabilities. This can handle more general cases than the original patch.

davidxl added inline comments.Jul 23 2015, 4:26 PM

lib/CodeGen/MachineBlockPlacement.cpp
373 ↗	(On Diff #30533)	This is the case where Succ is in the outer loop of the current loop being laid out. In such a case, Succ can still be a viable layout successor of BB if BB ends up being the end of the current chain. This is tricky to handle as loop rotation may also change that -- so it seems to good to add this skipping condition, but I'd like to see test cases that can validate it.
381 ↗	(On Diff #30533)	This is a good one to add.

congh added inline comments.Jul 23 2015, 5:08 PM

lib/CodeGen/MachineBlockPlacement.cpp
373 ↗	(On Diff #30533)	Given the following example: --->A \| / \ \| B C \| \ / \ ----D E Assume the weights of A->B and A->C are 1 and 99, and the weights of C->D and C->E are 50 and 50. Without this patch, after A->C is chosen, C->D could not be chosen due to CFG constraint and the probability of it is only 50%. So the generated layout for the loop is A C B D. With this patch, the layout will be A C D B, which is better than before. (Note that with loop rotation the latter may suffer some issue but it isn't the problem of this patch; I will deal with it alter). I will add such a test case.

Add one more test case showing that we ignore the edge outside of the loop during weights summation when we build the chain for the loop.

chandlerc added inline comments.Jul 24 2015, 7:59 PM

lib/CodeGen/MachineBlockPlacement.cpp
365–371 ↗	(On Diff #30549)	This is made incredibly annoying because we don't rescale the branch probabilities in MBPI the way we do in BPI, and so we don't know for a fact that they fit in 32-bits. I think we need to fix MBPI to automatically scale all the weights so that the sum of weights is known to fit in 32-bits. Once that's done, WeightScale and all the code in getSumForBlock to rescale things goes away. Then, here, you should change this to just compute the sum directly rather than getting the sum from MBPI and then adjusting it after-the-fact. Also, I would just embed the predicate into a lambda, and write two loops over BB->successors() rather than building up a vector of successors.
381–383 ↗	(On Diff #30549)	This is a bigger change than the rest, and I think it should go in separately. The rest of this change is a clear bug-fix for how this code works, and will make the "hot" detection actually reasonable below. This change, however, is shifting a core thing to prefer to outline all optional loop bodies which don't have fallthrough into them. That, to me, is a very significant change, and I'd rather look at it (and test cases and benchmarks showing it is clearly the right call) independently of fixing how we compute the probability of 1 of N successors.

davidxl added inline comments.Jul 24 2015, 9:19 PM

lib/CodeGen/MachineBlockPlacement.cpp
381–383 ↗	(On Diff #30549)	The core idea of Cong's patch is to make trace formation more correctly. The basic principle of the original trace formation (probability based) is that when a BB has multiple viable successors, but none of them is more dominating, then do not use outgoing succ edge frequency to determine layout successor The bug in the current implementation is that when checking dominating outgoing edges, it also look at non-viable successors that have already being laid out. The fix here is simply to eliminate the non-candidate when selecting trace successor. Cong, as I mentioned before, it is much clearer to extract the adjustment code into a helper routine with documentation to make the intention clear.

congh added inline comments.Jul 27 2015, 12:36 AM

lib/CodeGen/MachineBlockPlacement.cpp
365–371 ↗	(On Diff #30549)	OK. Let me try to produce a patch doing edge weight normalization for MBPI.
381–383 ↗	(On Diff #30549)	I have some cases that this change is beneficial. But I agree we need to reevaluate its impact on loops without fall-through to header (from the aspect of icache locality). I can disable this "optimization" attempt from this patch and do more research on it.

congh added inline comments.Jul 28 2015, 3:55 PM

lib/CodeGen/MachineBlockPlacement.cpp
365–371 ↗	(On Diff #30549)	I found it not straightforward to normalize weights in MBPI. Note that MBPI reads weights from MBB, which provides the interface addSuccessor(MachineBasicBlock *succ, uint32_t weight) to let users indicate arbitrary weights for new added edges. Because we don't know how users use this interface, it can be difficult to adjust weights in MBB. I think there may be two solutions: We provide the interface normalizeWeights in MBB. Once we call this interface, the sum of weights is guaranteed to be less than UINT32_MAX. However, after the user add new successors with weights, we need to re-normalize it. This can still be useful when we want to calculate branch probabilities or do other adjustment after the weights are normalized. We maintained both normalized and original weights in MBB, so that once the original weights is modified, we also adjust the normalized weights. This solutions requires more space and more frequent calculations comparing to the first one, but the normalized weights are always guaranteed. Do you have any other ideas? Or which one do you think is better?

congh added a reviewer: djasper.Sep 10 2015, 1:17 PM

Ping?

Will get back to it .

Looks fine to me with some nits. It might be better to hold off this until the BP interface change is done?

lib/CodeGen/MachineBlockPlacement.cpp
373 ↗	(On Diff #30549)	Can you add your example as a comment in the code?

Update the patch according to David's comment.

In D10825#291191, @davidxl wrote:

Looks fine to me with some nits. It might be better to hold off this until the BP interface change is done?

That patch will update the code in this file and I think it is fine to check in it now.

Closed by commit rL253414: Improving edge probabilities computation when choosing the best successor in… (authored by conghou). · Explain WhyNov 17 2015, 4:55 PM

This revision was automatically updated to reflect the committed changes.

Diff 40459

llvm/trunk/lib/CodeGen/MachineBlockPlacement.cpp

Show First 20 Lines • Show All 383 Lines • ▼ Show 20 Lines	MachineBlockPlacement::selectBestSuccessor(MachineBasicBlock *BB,
// the MBPI analysis, we manually compute probabilities using the edge		// the MBPI analysis, we manually compute probabilities using the edge
// weights. This is suboptimal as it means that the somewhat subtle		// weights. This is suboptimal as it means that the somewhat subtle
// definition of edge weight semantics is encoded here as well. We should		// definition of edge weight semantics is encoded here as well. We should
// improve the MBPI interface to efficiently support query patterns such as		// improve the MBPI interface to efficiently support query patterns such as
// this.		// this.
uint32_t BestWeight = 0;		uint32_t BestWeight = 0;
uint32_t WeightScale = 0;		uint32_t WeightScale = 0;
uint32_t SumWeight = MBPI->getSumForBlock(BB, WeightScale);		uint32_t SumWeight = MBPI->getSumForBlock(BB, WeightScale);
DEBUG(dbgs() << "Attempting merge from: " << getBlockName(BB) << "\n");
		// Adjust sum of weights by excluding weights on edges pointing to blocks that
		// is either not in BlockFilter or is already in the current chain. Consider
		// the following CFG:
		//
		// --->A
		// \| / \
		// \| B C
		// \| \ / \
		// ----D E
		//
		// Assume A->C is very hot (>90%), and C->D has a 50% probability, then after
		// A->C is chosen as a fall-through, D won't be selected as a successor of C
		// due to CFG constraint (the probability of C->D is not greater than
		// HotProb). If we exclude E that is not in BlockFilter when calculating the
		// probability of C->D, D will be selected and we will get A C D B as the
		// layout of this loop.
		uint32_t AdjustedSumWeight = SumWeight;
		SmallVector<MachineBasicBlock *, 4> Successors;
for (MachineBasicBlock *Succ : BB->successors()) {		for (MachineBasicBlock *Succ : BB->successors()) {
if (BlockFilter && !BlockFilter->count(Succ))		bool SkipSucc = false;
continue;		if (BlockFilter && !BlockFilter->count(Succ)) {
BlockChain &SuccChain = *BlockToChain[Succ];		SkipSucc = true;
if (&SuccChain == &Chain) {		} else {
DEBUG(dbgs() << " " << getBlockName(Succ) << " -> Already merged!\n");		BlockChain *SuccChain = BlockToChain[Succ];
continue;		if (SuccChain == &Chain) {
}		DEBUG(dbgs() << " " << getBlockName(Succ)
if (Succ != *SuccChain.begin()) {		<< " -> Already merged!\n");
		SkipSucc = true;
		} else if (Succ != *SuccChain->begin()) {
DEBUG(dbgs() << " " << getBlockName(Succ) << " -> Mid chain!\n");		DEBUG(dbgs() << " " << getBlockName(Succ) << " -> Mid chain!\n");
continue;		continue;
}		}
		}
		if (SkipSucc)
		AdjustedSumWeight -= MBPI->getEdgeWeight(BB, Succ) / WeightScale;
		else
		Successors.push_back(Succ);
		}

		DEBUG(dbgs() << "Attempting merge from: " << getBlockName(BB) << "\n");
		for (MachineBasicBlock *Succ : Successors) {
uint32_t SuccWeight = MBPI->getEdgeWeight(BB, Succ);		uint32_t SuccWeight = MBPI->getEdgeWeight(BB, Succ);
BranchProbability SuccProb(SuccWeight / WeightScale, SumWeight);		BranchProbability SuccProb(SuccWeight / WeightScale, AdjustedSumWeight);

// If we outline optional branches, look whether Succ is unavoidable, i.e.		// If we outline optional branches, look whether Succ is unavoidable, i.e.
// dominates all terminators of the MachineFunction. If it does, other		// dominates all terminators of the MachineFunction. If it does, other
// successors must be optional. Don't do this for cold branches.		// successors must be optional. Don't do this for cold branches.
if (OutlineOptionalBranches && SuccProb > HotProb.getCompl() &&		if (OutlineOptionalBranches && SuccProb > HotProb.getCompl() &&
UnavoidableBlocks.count(Succ) > 0) {		UnavoidableBlocks.count(Succ) > 0) {
auto HasShortOptionalBranch = [&]() {		auto HasShortOptionalBranch = [&]() {
for (MachineBasicBlock *Pred : Succ->predecessors()) {		for (MachineBasicBlock *Pred : Succ->predecessors()) {
Show All 11 Lines	if (OutlineOptionalBranches && SuccProb > HotProb.getCompl() &&
return false;		return false;
};		};
if (!HasShortOptionalBranch())		if (!HasShortOptionalBranch())
return Succ;		return Succ;
}		}

// Only consider successors which are either "hot", or wouldn't violate		// Only consider successors which are either "hot", or wouldn't violate
// any CFG constraints.		// any CFG constraints.
		BlockChain &SuccChain = *BlockToChain[Succ];
if (SuccChain.LoopPredecessors != 0) {		if (SuccChain.LoopPredecessors != 0) {
if (SuccProb < HotProb) {		if (SuccProb < HotProb) {
DEBUG(dbgs() << " " << getBlockName(Succ) << " -> " << SuccProb		DEBUG(dbgs() << " " << getBlockName(Succ) << " -> " << SuccProb
<< " (prob) (CFG conflict)\n");		<< " (prob) (CFG conflict)\n");
continue;		continue;
}		}

// Make sure that a hot successor doesn't have a globally more		// Make sure that a hot successor doesn't have a globally more
// important predecessor.		// important predecessor.
		BranchProbability RealSuccProb(SuccWeight / WeightScale, SumWeight);
BlockFrequency CandidateEdgeFreq =		BlockFrequency CandidateEdgeFreq =
MBFI->getBlockFreq(BB) * SuccProb * HotProb.getCompl();		MBFI->getBlockFreq(BB) * RealSuccProb * HotProb.getCompl();
bool BadCFGConflict = false;		bool BadCFGConflict = false;
for (MachineBasicBlock *Pred : Succ->predecessors()) {		for (MachineBasicBlock *Pred : Succ->predecessors()) {
if (Pred == Succ \|\| (BlockFilter && !BlockFilter->count(Pred)) \|\|		if (Pred == Succ \|\| (BlockFilter && !BlockFilter->count(Pred)) \|\|
BlockToChain[Pred] == &Chain)		BlockToChain[Pred] == &Chain)
continue;		continue;
BlockFrequency PredEdgeFreq =		BlockFrequency PredEdgeFreq =
MBFI->getBlockFreq(Pred) * MBPI->getEdgeProbability(Pred, Succ);		MBFI->getBlockFreq(Pred) * MBPI->getEdgeProbability(Pred, Succ);
if (PredEdgeFreq >= CandidateEdgeFreq) {		if (PredEdgeFreq >= CandidateEdgeFreq) {
▲ Show 20 Lines • Show All 999 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/code_placement_ignore_succ_in_inner_loop.ll

				; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux < %s \| FileCheck %s -check-prefix=CHECK

				define void @foo() {
				; Test that when determining the edge probability from a node in an inner loop
				; to a node in an outer loop, the weights on edges in the inner loop should be
				; ignored if we are building the chain for the outer loop.
				;
				; CHECK-LABEL: foo:
				; CHECK: callq c
				; CHECK: callq b

				entry:
				%call = call zeroext i1 @a()
				br i1 %call, label %if.then, label %if.else, !prof !1

				if.then:
				%call1 = call zeroext i1 @a()
				br i1 %call1, label %while.body, label %if.end.1, !prof !1

				while.body:
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %if.then.1, label %while.cond

				if.then.1:
				call void @d()
				br label %while.cond

				while.cond:
				%call3 = call zeroext i1 @a()
				br i1 %call3, label %while.body, label %if.end

				if.end.1:
				call void @d()
				br label %if.end

				if.else:
				call void @b()
				br label %if.end

				if.end:
				call void @c()
				ret void
				}

				define void @bar() {
				; Test that when determining the edge probability from a node in a loop to a
				; node in its peer loop, the weights on edges in the first loop should be
				; ignored.
				;
				; CHECK-LABEL: bar:
				; CHECK: callq c
				; CHECK: callq b

				entry:
				%call = call zeroext i1 @a()
				br i1 %call, label %if.then, label %if.else, !prof !1

				if.then:
				%call1 = call zeroext i1 @a()
				br i1 %call1, label %if.then, label %while.body, !prof !2

				while.body:
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %while.body, label %if.end, !prof !2

				if.else:
				call void @b()
				br label %if.end

				if.end:
				call void @c()
				ret void
				}

				define void @par() {
				; Test that when determining the edge probability from a node in a loop to a
				; node in its outer loop, the weights on edges in the outer loop should be
				; ignored if we are building the chain for the inner loop.
				;
				; CHECK-LABEL: par:
				; CHECK: callq c
				; CHECK: callq d
				; CHECK: callq b

				entry:
				br label %if.cond

				if.cond:
				%call = call zeroext i1 @a()
				br i1 %call, label %if.then, label %if.else, !prof !3

				if.then:
				call void @b()
				br label %if.end

				if.else:
				call void @c()
				%call1 = call zeroext i1 @a()
				br i1 %call1, label %if.end, label %exit, !prof !4

				if.end:
				call void @d()
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %if.cond, label %if.end.2, !prof !2

				if.end.2:
				call void @e()
				br label %if.cond

				exit:
				ret void
				}

				declare zeroext i1 @a()
				declare void @b()
				declare void @c()
				declare void @d()
				declare void @e()

				!1 = !{!"branch_weights", i32 10, i32 1}
				!2 = !{!"branch_weights", i32 100, i32 1}
				!3 = !{!"branch_weights", i32 1, i32 100}
				!4 = !{!"branch_weights", i32 1, i32 1}

This is an archive of the discontinued LLVM Phabricator instance.

Improvement on computing edge probabilities when choosing the best successor in machine block placement.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 40459

llvm/trunk/lib/CodeGen/MachineBlockPlacement.cpp

llvm/trunk/test/CodeGen/X86/code_placement_ignore_succ_in_inner_loop.ll

This is an archive of the discontinued LLVM Phabricator instance.

Improvement on computing edge probabilities when choosing the best successor in machine block placement. ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 40459

llvm/trunk/lib/CodeGen/MachineBlockPlacement.cpp

llvm/trunk/test/CodeGen/X86/code_placement_ignore_succ_in_inner_loop.ll

Improvement on computing edge probabilities when choosing the best successor in machine block placement.
ClosedPublic