This is an archive of the discontinued LLVM Phabricator instance.

Update loop branch_weight metadata after loop rotation.
Needs ReviewPublic

Authored by trentxintong on Jan 11 2017, 7:24 PM.

Download Raw Diff

Details

Reviewers

anemet
mzolotukhin
davidxl
danielcdh
mkuper
hfinkel

Summary

Update loop branch_weight metadata after loop rotation.

In case we have branch_weight in the unrotated loop header, we update
it after rotation, more specifically we update the branch in the guard
block and the branch in the rotated latch block.

Diff Detail

Build Status

Buildable 6075
Build 6075: arc lint + arc unit

Event Timeline

trentxintong updated this revision to Diff 84068.Jan 11 2017, 7:24 PM

trentxintong retitled this revision from to Update loop branch_weight metadata after loop rotation..

trentxintong updated this object.

trentxintong added reviewers: mkuper, davidxl, danielcdh, hfinkel, mzolotukhin, anemet.

trentxintong added a subscriber: llvm-commits.

twoh added a subscriber: twoh.Jan 11 2017, 10:26 PM

danielcdh added inline comments.Jan 12 2017, 9:04 AM

test/Transforms/LoopRotate/loop-rotate-pgo.ll
61	I'm a little confused the logic here. For the following code: L1: if (cond) { // taken probability 7/35 stmt; goto L1; } The loop rotate originally transforms it to: if (cond) { // taken probability 7/35 L1: stmt; if (cond) { // taken probability 7/35 goto L1; } } Your patch change it to: if (cond) { // taken probability 7/28 L1: stmt if (cond) { // taken probability 1/7 goto L1; } } Assuming "cond" has fixed taken probability, the original probability seems more correct to me? Or am I missing something?

@danielcdh Its the true the cond is not changed, but the inputs to the condition has changed, before you were testing with index variable in the loop header, now u have 2 conditions and the SSA values you use in the conditions have changed, so the probability needs to be adjusted. I will explain more when i get to office.

After we rotate the loop, we duplicate the comparison the old header into the guard block and we move the header to the end of the loop. So basically we have 2 branches that carry the branch_weight metadata they need to be adjusted. With this patch, we only do the adjustment for loop with only 1 exitting block, if the loop has early exits, its harder to adjust the branch weight properly. e.g. if we have an early exit in the loop, we will not be able to tell whether the latch block will ever be reached after rotation, not to mention to adjust its branch weight, (we need to look at the branch weights of the early exits to do this properly).

There are 2 conditions that are constant after rotation, (1) # of times the loop body is executed and (2) the # of times the exit block is executed. (these are the values we can get by extracting the prof metadata from the header branch before rotation). The backedge weight should simply be the # of times the loop body is executed - # of times the loop exit is executed. With the loop exit count, loop body weight and loop backedge weight, we can compute the branch weight for the guard block.

This is true iff the loop executes at least once every time. If the loop body execute seldom, they we can not do it (and we will get a negative backedge weight num with # of times the loop body is executed - # of times the loop exit is executed.) In this case, we cap the backedge # to 1 (should be 0). And with this information we can compute loop exit weight and all the other weights.

Thats roughly what the patch is doing and how the numbers in the test case is computed.

mkuper added inline comments.Jan 12 2017, 12:59 PM

lib/Transforms/Scalar/LoopRotation.cpp
527	Also, regardless of the rest of the discussion - I don't think we should drop the metadata on the floor if we fail. I don't think "No data is better than imprecise data" is right in the general case, but that's arguable. Specifically here, though, we're imprecise even if updateLoopEstimatedBranchWeight() succeeds, because of the assumptions we make on the distribution.

trentxintong added inline comments.Jan 12 2017, 1:29 PM

lib/Transforms/Scalar/LoopRotation.cpp
527	I am fine changing this, i.e. keeping the metadata when we cant update them properly. I think they would still be useful (not counter-productive) in this case, even though imprecise.

Address suggestions.

Hopefully remove some unrelated changes to some unrelated files.

I only want the head commit to be included in this patch.

mkuper mentioned this in D28460: getLoopEstimatedTripCount should really be called getLoopEstimatedBackedgeTakeCount..Jan 12 2017, 9:32 PM

I want to merge https://reviews.llvm.org/D28460, but I do not know how to do
so on Phabricator. So I merged it manually =).

Thanks a lot for the example, I'll look into it.

Dehao

Looks like this patch will make the "always call" worse:

Without this patch:

pushq   %rbx
movq    %rdi, %rbx
cmpl    $0, (%rbx)
jne     .LBB1_3

.LBB1_1: # =>This Inner Loop Header: Depth=1

movq    %rbx, %rdi
callq   call_me
cmpl    $0, (%rbx)
je      .LBB1_1

.LBB1_3:

popq    %rbx
retq

With this patch:

pushq   %rbx
movq    %rdi, %rbx
cmpl    $0, (%rbx)
je      .LBB1_1

.LBB1_3:

popq    %rbx
retq

.LBB1_1: # =>This Inner Loop Header: Depth=1

movq    %rbx, %rdi
callq   call_me
cmpl    $0, (%rbx)
jne     .LBB1_3
jmp     .LBB1_1

As the trip count of this loop is always 1, the first code will have no taken branches, while with this patch, it will have 2 taken branches.

Guard block branches to preheader instead of header.

Talked with Xin offline:

The comparison I made is (clean client v.s. entire patch), while Xin is comparing (D28460 v.s. entire patch)
We both agree that the patch works fine for cases when avg trip count >= 0.5
For cases when avg trip count <0.5 3.1 both have good code layout, the set probability for the guard branch are both reasonable 3.2 this patch will peel the loop by 1, while no_patch will not peel the loop, so the patch has code size overhead 3.3 the peeled loop may have better performance on the cold path when the loop trip count is unevenly distributed (but I started to doubt this now)
If 3.3 is true, then the problem becomes whether we want to trade code size for performance at cold path. Xin is going to run some benchmarks with and without the test to measure code size and performance impact of this patch and report back.

mkuper mentioned this in D27734: [LoopUnroll] Enable PGO-based loop peeling by default.Jan 13 2017, 3:30 PM

I am still in middle of getting a machine which i can do performance runs on. The machine I have is not very stable, i.e. specrun # fluctuates from run to run.

davidxl added inline comments.Feb 1 2017, 2:47 PM

lib/Transforms/Scalar/LoopRotation.cpp
88	name the parameters.
525	Remove the comment about OrigHeader BR. After Block merging, it will become the latch block branch instruction. Is it better to call this after MergeBlockInfoPredecessor when loop L is a more consistent state?
550	exitting --> exiting
575	you need a test case to cover LoopBodyWeight > LoopExitWeight case --- not the one with known tripcount. For the constant trip count case, the guard BR gets eliminated.
614	Explain this condition? Perhaps a test case?
617	This is probably not quite correct. For the case when LoopBodyWeight > LoopExitWeight, especially when the runtime trip count is large, the guard BR should have very biased against exiting. I think it is reasonable to use an arbitrary biased exit probability such as 1/100. For cases when LoopBodyWeight and LoopExitWeight are close (e.g., trip count < 5), it is reasonable to keep GuardBR has the same exit branch probability as the new LatchBR. This applies to LoopBodyWeight < LoopExitWeight case as well.
test/Transforms/LoopRotate/loop-rotate-pgo.ll
10	Perhaps name it 'loop_with_known_trip_count'
17	This loop has statically known trip count. The profile data does not match it. Use a different meta data.
33	add comment the loop body is by-passed most of the time.
37	explicitly check br instruction as well so that two branch targets are listed. !prof is meaningless without looking at the branch
51	Same here -- also check branch instruction

I am very sorry I have not put up the #s for speccpu 2006. I am stuck in middle of a few things and I will put it up as soon as I have them. I will also address the comments too.

Address davidxl's comments.

I reworked how !prof metadata is computed after loop rotation.

There is one test case in peel-loop-pgo.ll. thats because I corrected how
estimated loop trip count is computed. I think its a good idea to fix loop rotation
metadada and getEstimatedTripCount together as getEstimatedTripCount uses
!prof metadata which is computed in a new way. However, the drawback is that
i need to modify a test case unrelated to loop rotation.

I ran C/C++ benchmarks in CPU2006 with the current state of the patch (baseline is without the metadata adjustment, negative percent means the benchmark runs slower or code size becomes smaller after the patch. It seems the the regressions in 429 and 444 are real. And we also have a code size reduction of -2.41% in 401. Overall, we have more performance regressions after we adjust the metadata this way.

Benchmark	Perf		CodeSize
400		-0.22%		0.00%
401		0.76%		-2.41%
403		-0.51%		0.07%
429		-2.09%		0.32%
445		0.64%		-0.08%
456		-0.29%		0.01%
458		0.00%		-0.04%
462		-0.15%		0.00%
464		0.16%		0.00%
471		-0.38%		-0.02%
473		0.00%		-0.21%
483		-0.40%		-0.04%
433		0.73%		0.03%
444		-2.15%		-0.26%
447		-0.17%		-0.06%
450		0.00%		-0.05%
453		0.38%		-0.01%
470		0.12%		0.00%
482		0.18%		0.14%

davidxl added inline comments.Feb 21 2017, 2:33 PM

lib/Transforms/Scalar/LoopRotation.cpp
591	What if the average trip count is > 20? In that case, 0.05 is larger than the latch br exit probability which is not right. The 'min' needs to be taken. It is also better to use fixed point operation (with BranchProbablilty and BlockFrequency classes) BranchProbability GuardBP(5, 100); uint64_t NotTakenOnceWeight = (BlockFrequency(LoopExitWeight) * GuardBP).getFrequency(); uint64_t LoopPHWeight = LoopExitWeight - NotTakenOnceWeight; GuardBR->setMetadata(MD_prof, ...); '5' also needs to be a parameter. Do not use hard coded number.
623	same here, 0.95 may be smaller than the original Header BR's exit branch probability. The max should be taken. The value 0.5 is not necessary -- why not just directly use original header BR's exit BP?
636	This can be done by computing the right notTakenOnceRatio above.
649	The weight computation code above should be refactored and shared across different cases. The difference only lies in the way NotTakenOnceRatio is computed.

Sorry for the really long hiatus. I am getting back to this one.

Rework the reworked metadata update in loop rotation. I still have some hardcoded numbers
i do not know whether to put them as options.

At least, we should try to agree on the mechanism itself and start collecting some numbers.

Harbormaster completed remote builds in B6070: Diff 97497.May 2 2017, 1:58 PM

Add asserts and comments.

Incorrectly used max in place of min.

Harbormaster completed remote builds in B6075: Diff 97505.May 2 2017, 2:52 PM

davidxl added inline comments.May 3 2017, 12:22 PM

lib/Transforms/Scalar/LoopRotation.cpp
561	maximum--> minimum
578	Avoid using floating point operation here. Just do: if (LoopBodyWeight > 20 * LoopExitWeight) .. Branch Weight Data is actually 32bit, so there is no risk of overflow here.
584	When ACT < 0.2, it seems simpler just do NotTakenOnceProb = (ExitWeight - BodyWeight)/ExitWeight.
599	This adjustment will create weird results: For ACT between 1 and 5, the computed NonTakenOnce Prob is kept which 50%. However, when ACT is between 0.5 and 1, the resulting not taken once Prob is smaller than 50% which contradicts to the trend -- the smaller ACT is, the more likely it is not executed once. To fix this, do the following: When ACT is between 0.5 and 5, use 50% When ACT is less than 0.5, use (ExitWeight - BodyWeight)/ExitWeight This makes the weight 'continuous and the special handling using 'min' can be removed.

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

LoopRotation.cpp

167 lines

Utils/

LoopUtils.cpp

6 lines

test/

Transforms/

LoopRotate/

loop-rotate-pgo.ll

154 lines

LoopUnroll/

peel-loop-pgo.ll

16 lines

Diff 97505

lib/Transforms/Scalar/LoopRotation.cpp

Show All 22 Lines
#include "llvm/Analysis/ScalarEvolution.h"		#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/ScalarEvolutionAliasAnalysis.h"		#include "llvm/Analysis/ScalarEvolutionAliasAnalysis.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/IR/CFG.h"		#include "llvm/IR/CFG.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
		#include "llvm/IR/MDBuilder.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
		#include "llvm/Support/BlockFrequency.h"
		#include "llvm/Support/BranchProbability.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"
#include "llvm/Transforms/Scalar/LoopPassManager.h"		#include "llvm/Transforms/Scalar/LoopPassManager.h"
#include "llvm/Transforms/Utils/BasicBlockUtils.h"		#include "llvm/Transforms/Utils/BasicBlockUtils.h"
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"
#include "llvm/Transforms/Utils/LoopUtils.h"		#include "llvm/Transforms/Utils/LoopUtils.h"
#include "llvm/Transforms/Utils/SSAUpdater.h"		#include "llvm/Transforms/Utils/SSAUpdater.h"
#include "llvm/Transforms/Utils/ValueMapper.h"		#include "llvm/Transforms/Utils/ValueMapper.h"
		#include <algorithm> /// std::max
using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "loop-rotate"		#define DEBUG_TYPE "loop-rotate"

static cl::opt<unsigned> DefaultRotationThreshold(		static cl::opt<unsigned> DefaultRotationThreshold(
"rotation-max-header-size", cl::init(16), cl::Hidden,		"rotation-max-header-size", cl::init(16), cl::Hidden,
cl::desc("The default maximum header size for automatic loop rotation"));		cl::desc("The default maximum header size for automatic loop rotation"));

Show All 16 Lines	LoopRotate(unsigned MaxHeaderSize, LoopInfo *LI,
DominatorTree DT, ScalarEvolution SE, const SimplifyQuery &SQ)		DominatorTree DT, ScalarEvolution SE, const SimplifyQuery &SQ)
: MaxHeaderSize(MaxHeaderSize), LI(LI), TTI(TTI), AC(AC), DT(DT), SE(SE),		: MaxHeaderSize(MaxHeaderSize), LI(LI), TTI(TTI), AC(AC), DT(DT), SE(SE),
SQ(SQ) {}		SQ(SQ) {}
bool processLoop(Loop *L);		bool processLoop(Loop *L);

private:		private:
bool rotateLoop(Loop *L, bool SimplifiedLatch);		bool rotateLoop(Loop *L, bool SimplifiedLatch);
bool simplifyLoopLatch(Loop *L);		bool simplifyLoopLatch(Loop *L);

		/// Update the branch_weight metadata for the rotated loop latch and guard
		/// block.
		///
		/// NOTE: this function assumes that the tripcounts every time the loop is
		/// executed are similar or bell curve like. PGO does not provide us with a
		/// distribution of the tripcounts of the loop, it merely gives a summation
		/// of all the times the branch is taken or not taken when the loop is hit.
		/// In case the loop has a more skewed tripcount distribution we could end
		/// up underestimating the tripcount each time the loop is executed.
		void updateLoopEstimatedBranchWeight(Loop L, BranchInst GuardBR,
		davidxlUnsubmitted Not Done Reply Inline Actions name the parameters. davidxl: name the parameters.
		BranchInst *LatchBR);
};		};
} // end anonymous namespace		} // end anonymous namespace

/// RewriteUsesOfClonedInstructions - We just cloned the instructions from the		/// RewriteUsesOfClonedInstructions - We just cloned the instructions from the
/// old header into the preheader. If there were uses of the values produced by		/// old header into the preheader. If there were uses of the values produced by
/// these instruction that were outside of the loop, we have to insert PHI nodes		/// these instruction that were outside of the loop, we have to insert PHI nodes
/// to merge the two values. Do this now.		/// to merge the two values. Do this now.
static void RewriteUsesOfClonedInstructions(BasicBlock *OrigHeader,		static void RewriteUsesOfClonedInstructions(BasicBlock *OrigHeader,
▲ Show 20 Lines • Show All 316 Lines • ▼ Show 20 Lines	bool LoopRotate::rotateLoop(Loop *L, bool SimplifiedLatch) {
// At this point, we've finished our major CFG changes. As part of cloning		// At this point, we've finished our major CFG changes. As part of cloning
// the loop into the preheader we've simplified instructions and the		// the loop into the preheader we've simplified instructions and the
// duplicated conditional branch may now be branching on a constant. If it is		// duplicated conditional branch may now be branching on a constant. If it is
// branching on a constant and if that constant means that we enter the loop,		// branching on a constant and if that constant means that we enter the loop,
// then we fold away the cond branch to an uncond branch. This simplifies the		// then we fold away the cond branch to an uncond branch. This simplifies the
// loop in cases important for nested loops, and it also means we don't have		// loop in cases important for nested loops, and it also means we don't have
// to split as many edges.		// to split as many edges.
BranchInst *PHBI = cast<BranchInst>(OrigPreheader->getTerminator());		BranchInst *PHBI = cast<BranchInst>(OrigPreheader->getTerminator());
		BranchInst *GBI = PHBI;
assert(PHBI->isConditional() && "Should be clone of BI condbr!");		assert(PHBI->isConditional() && "Should be clone of BI condbr!");
if (!isa<ConstantInt>(PHBI->getCondition()) \|\|		if (!isa<ConstantInt>(PHBI->getCondition()) \|\|
PHBI->getSuccessor(cast<ConstantInt>(PHBI->getCondition())->isZero()) !=		PHBI->getSuccessor(cast<ConstantInt>(PHBI->getCondition())->isZero()) !=
NewHeader) {		NewHeader) {
// The conditional branch can't be folded, handle the general case.		// The conditional branch can't be folded, handle the general case.
// Update DominatorTree to reflect the CFG change we just made. Then split		// Update DominatorTree to reflect the CFG change we just made. Then split
// edges as necessary to preserve LoopSimplify form.		// edges as necessary to preserve LoopSimplify form.
if (DT) {		if (DT) {
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	assert(SplitLatchEdge &&
"Despite splitting all preds, failed to split latch exit?");		"Despite splitting all preds, failed to split latch exit?");
} else {		} else {
// We can fold the conditional branch in the preheader, this makes things		// We can fold the conditional branch in the preheader, this makes things
// simpler. The first step is to remove the extra edge to the Exit block.		// simpler. The first step is to remove the extra edge to the Exit block.
Exit->removePredecessor(OrigPreheader, true /preserve LCSSA/);		Exit->removePredecessor(OrigPreheader, true /preserve LCSSA/);
BranchInst *NewBI = BranchInst::Create(NewHeader, PHBI);		BranchInst *NewBI = BranchInst::Create(NewHeader, PHBI);
NewBI->setDebugLoc(PHBI->getDebugLoc());		NewBI->setDebugLoc(PHBI->getDebugLoc());
PHBI->eraseFromParent();		PHBI->eraseFromParent();
		GBI = NewBI;

// With our CFG finalized, update DomTree if it is available.		// With our CFG finalized, update DomTree if it is available.
if (DT) {		if (DT) {
// Update OrigHeader to be dominated by the new header block.		// Update OrigHeader to be dominated by the new header block.
DT->changeImmediateDominator(NewHeader, OrigPreheader);		DT->changeImmediateDominator(NewHeader, OrigPreheader);
DT->changeImmediateDominator(OrigHeader, OrigLatch);		DT->changeImmediateDominator(OrigHeader, OrigLatch);

// Brute force incremental dominator tree update. Call		// Brute force incremental dominator tree update. Call
Show All 23 Lines	if (DT) {

// If the dominator changed, this may have an effect on other		// If the dominator changed, this may have an effect on other
// predecessors, continue until we reach a fixpoint.		// predecessors, continue until we reach a fixpoint.
} while (Changed);		} while (Changed);
}		}
}		}

assert(L->getLoopPreheader() && "Invalid loop preheader after loop rotation");		assert(L->getLoopPreheader() && "Invalid loop preheader after loop rotation");
assert(L->getLoopLatch() && "Invalid loop latch after loop rotation");		assert(L->getLoopLatch() && "Invalid loop latch after loop rotation");
		davidxlUnsubmitted Not Done Reply Inline Actions Remove the comment about OrigHeader BR. After Block merging, it will become the latch block branch instruction. Is it better to call this after MergeBlockInfoPredecessor when loop L is a more consistent state? davidxl: Remove the comment about OrigHeader BR. After Block merging, it will become the latch block…

// Now that the CFG and DomTree are in a consistent state again, try to merge		// Now that the CFG and DomTree are in a consistent state again, try to merge
		mkuperUnsubmitted Not Done Reply Inline Actions Also, regardless of the rest of the discussion - I don't think we should drop the metadata on the floor if we fail. I don't think "No data is better than imprecise data" is right in the general case, but that's arguable. Specifically here, though, we're imprecise even if updateLoopEstimatedBranchWeight() succeeds, because of the assumptions we make on the distribution. mkuper: Also, regardless of the rest of the discussion - I don't think we should drop the metadata on…
		trentxintongAuthorUnsubmitted Not Done Reply Inline Actions I am fine changing this, i.e. keeping the metadata when we cant update them properly. I think they would still be useful (not counter-productive) in this case, even though imprecise. trentxintong: I am fine changing this, i.e. keeping the metadata when we cant update them properly. I think…
// the OrigHeader block into OrigLatch. This will succeed if they are		// the OrigHeader block into OrigLatch. This will succeed if they are
// connected by an unconditional branch. This is just a cleanup so the		// connected by an unconditional branch. This is just a cleanup so the
// emitted code isn't too gross in this common case.		// emitted code isn't too gross in this common case.
MergeBlockIntoPredecessor(OrigHeader, DT, LI);		MergeBlockIntoPredecessor(OrigHeader, DT, LI);

		// We now have done all the transformations on the CFG. Adjust the loop branch
		// metadata in case we have them.
		assert(L->getLoopLatch()->getTerminator() == BI &&
		"Unexpected latch terminator");
		updateLoopEstimatedBranchWeight(L, GBI, BI);

DEBUG(dbgs() << "LoopRotation: into "; L->dump());		DEBUG(dbgs() << "LoopRotation: into "; L->dump());

++NumRotated;		++NumRotated;
return true;		return true;
}		}

		/// Based on the loop body weight and the loop exit weight, determine the # of
		/// times the loop exits without a single iteration.
		static uint64_t computeLoopNotTakenOnceWeight(uint64_t LoopBodyWeight,
		uint64_t LoopExitWeight) {
		unsigned NotTakenOncePercent = 0;
		double ATC = (double)LoopBodyWeight / LoopExitWeight;
		davidxlUnsubmitted Not Done Reply Inline Actions exitting --> exiting davidxl: exitting --> exiting
		// At this point, we have the # of times the loop body and loop exit execute.
		// Based on this information, we compute the # of times the loop exits without
		// a single iteration, i.e. not-taken-once-weight.
		//
		// NOTE: ATC does not directly tells us how many times the loop exits without
		// executing a single iteration.
		//
		// We define the following discrete average-trip-count (ATC) thresholds:
		//
		// 1. Loop with really large ATC (20+). This loop probably always executes
		// a few iterations every time its hit. We give it maximum chance to exit
		davidxlUnsubmitted Not Done Reply Inline Actions maximum--> minimum davidxl: maximum--> minimum
		// without a single iteration.
		//
		// 2. Loop with large ATC (5+). We give it 5% chance to exit without a single
		// iteration. This is because everytime the loop is hit, it likely will branch
		// to the loop body.
		//
		// 3. Loop with ATC (~1+-). We give it a 50% chance to exit. This loop may
		// have sometimes exit wihtout a single iteration.
		//
		// 5. Loop with small ATC (~1/5). We give it 95% chance to exit, this
		// is because everytime the loop is hit, it most likely will branch
		// to the exit without executing a single iteration.
		//
		// 6. Loop with really small ATC (1/20-). We give it maximum chance to exit
		davidxlUnsubmitted Not Done Reply Inline Actions you need a test case to cover LoopBodyWeight > LoopExitWeight case --- not the one with known tripcount. For the constant trip count case, the guard BR gets eliminated. davidxl: you need a test case to cover LoopBodyWeight > LoopExitWeight case --- not the one with known…
		// without a single iteration. This loop probably seldom executes everytime
		// it is hit.
		if (ATC > 20)
		davidxlUnsubmitted Not Done Reply Inline Actions Avoid using floating point operation here. Just do: if (LoopBodyWeight > 20 * LoopExitWeight) .. Branch Weight Data is actually 32bit, so there is no risk of overflow here. davidxl: Avoid using floating point operation here. Just do: if (LoopBodyWeight > 20 * LoopExitWeight)…
		NotTakenOncePercent = 1;
		else if (ATC > 5)
		NotTakenOncePercent = 5;
		else if (ATC > 0.2)
		NotTakenOncePercent = 50;
		else if (ATC > 0.05)
		davidxlUnsubmitted Not Done Reply Inline Actions When ACT < 0.2, it seems simpler just do NotTakenOnceProb = (ExitWeight - BodyWeight)/ExitWeight. davidxl: When ACT < 0.2, it seems simpler just do NotTakenOnceProb = (ExitWeight…
		NotTakenOncePercent = 95;
		else
		NotTakenOncePercent = 1;

		// We now have the NotTakenOnce percentage, compute its value.
		BranchProbability GuardBP(NotTakenOncePercent, 100);
		uint64_t NotTakenOnce =
		davidxlUnsubmitted Not Done Reply Inline Actions What if the average trip count is > 20? In that case, 0.05 is larger than the latch br exit probability which is not right. The 'min' needs to be taken. It is also better to use fixed point operation (with BranchProbablilty and BlockFrequency classes) BranchProbability GuardBP(5, 100); uint64_t NotTakenOnceWeight = (BlockFrequency(LoopExitWeight) * GuardBP).getFrequency(); uint64_t LoopPHWeight = LoopExitWeight - NotTakenOnceWeight; GuardBR->setMetadata(MD_prof, ...); '5' also needs to be a parameter. Do not use hard coded number. davidxl: What if the average trip count is > 20? In that case, 0.05 is larger than the latch br exit…
		(BlockFrequency(LoopExitWeight) * GuardBP).getFrequency();

		// In addition to the thresholds, we need to make sure NotTakenOnce is at
		// least as large as its minimum, i.e. LoopExitWeight - LoopBodyWeight, in
		// case LoopExitWeight > LoopBodyWeight.
		if (LoopExitWeight < LoopBodyWeight)
		return NotTakenOnce;
		return std::min(LoopExitWeight - LoopBodyWeight, NotTakenOnce);
		davidxlUnsubmitted Not Done Reply Inline Actions This adjustment will create weird results: For ACT between 1 and 5, the computed NonTakenOnce Prob is kept which 50%. However, when ACT is between 0.5 and 1, the resulting not taken once Prob is smaller than 50% which contradicts to the trend -- the smaller ACT is, the more likely it is not executed once. To fix this, do the following: When ACT is between 0.5 and 5, use 50% When ACT is less than 0.5, use (ExitWeight - BodyWeight)/ExitWeight This makes the weight 'continuous and the special handling using 'min' can be removed. davidxl: This adjustment will create weird results: For ACT between 1 and 5, the computed NonTakenOnce…
		}

		/// We have now rotated the loop, update branch weight prof metadata in case
		/// we have it.
		void LoopRotate::updateLoopEstimatedBranchWeight(Loop L, BranchInst GuardBR,
		BranchInst *LatchBR) {
		// If loop has multiple exiting blocks, we need to look through each one
		// of them and reason about their profile metadata to compute the rotated
		// loop metadata correctly. e.g. if we have an early exit in the loop,
		// we will not be able to tell whether the latch block will ever be
		// reached after rotation, i.e. early exit could have been taken.
		// we bail for now.
		//
		// FIXME: Handle loops with early exits.
		if (!L->getExitingBlock())
		davidxlUnsubmitted Not Done Reply Inline Actions Explain this condition? Perhaps a test case? davidxl: Explain this condition? Perhaps a test case?
		return;

		// Check whether there is loop !prof metadata.
		davidxlUnsubmitted Not Done Reply Inline Actions This is probably not quite correct. For the case when LoopBodyWeight > LoopExitWeight, especially when the runtime trip count is large, the guard BR should have very biased against exiting. I think it is reasonable to use an arbitrary biased exit probability such as 1/100. For cases when LoopBodyWeight and LoopExitWeight are close (e.g., trip count < 5), it is reasonable to keep GuardBR has the same exit branch probability as the new LatchBR. This applies to LoopBodyWeight < LoopExitWeight case as well. davidxl: This is probably not quite correct. For the case when LoopBodyWeight > LoopExitWeight…
		uint64_t LoopBodyWeight, LoopExitWeight;
		if (!LatchBR->extractProfMetadata(LoopBodyWeight, LoopExitWeight))
		return;

		// At this point, we know we have a branch weight metadata, recompute it.
		//
		davidxlUnsubmitted Not Done Reply Inline Actions same here, 0.95 may be smaller than the original Header BR's exit branch probability. The max should be taken. The value 0.5 is not necessary -- why not just directly use original header BR's exit BP? davidxl: same here, 0.95 may be smaller than the original Header BR's exit branch probability. The max…
		// Update latch branch metadata. After the loop is rotated, the # of times
		// the loop body is executed remain the same and the # of times the exit
		// block is executed remain the same as well. We use these information to
		// compute the weight for the rest of the branches.
		//
		// Make sure the loop body and exit weight gets its correct value.
		bool LoopBodyOnTrue = LatchBR->getSuccessor(0) == L->getHeader();
		if (!LoopBodyOnTrue)
		std::swap(LoopBodyWeight, LoopExitWeight);

		// Once we know the # of times the loop exits without a single iteration,
		// we can compute other weights after rotating the loop.
		uint64_t NotTakenOnceWeight =
		davidxlUnsubmitted Not Done Reply Inline Actions This can be done by computing the right notTakenOnceRatio above. davidxl: This can be done by computing the right notTakenOnceRatio above.
		computeLoopNotTakenOnceWeight(LoopBodyWeight, LoopExitWeight);

		// Make sure NotTakenOnceWeight is always no larger than LoopExitWeight.
		assert(NotTakenOnceWeight <= LoopExitWeight && "Invalid LoopExitWeight");
		// # of times loop is entered is the same as # of times the latch branch
		// goes to the exit block.
		uint64_t LoopPHWeight = LoopExitWeight - NotTakenOnceWeight;

		// Backedge weight is the # of times the loop body is executed - # of times
		// loop body is entered.
		assert(LoopBodyWeight >= LoopPHWeight && "Invalid LoopPHWeight");
		uint64_t BackedgeWeight = LoopBodyWeight - LoopPHWeight;

		davidxlUnsubmitted Not Done Reply Inline Actions The weight computation code above should be refactored and shared across different cases. The difference only lies in the way NotTakenOnceRatio is computed. davidxl: The weight computation code above should be refactored and shared across different cases. The…
		// We finished computing the edge weights, normalize them a bit to
		// make sure every edge weight is at least as big as the minimum
		// edge weight.
		const uint64_t MinimumEdgeWeight = 1;
		LoopPHWeight = std::max(LoopPHWeight, MinimumEdgeWeight);
		NotTakenOnceWeight = std::max(NotTakenOnceWeight, MinimumEdgeWeight);
		BackedgeWeight = std::max(BackedgeWeight, MinimumEdgeWeight);

		// Sanity check ... no branch weight should be smaller than 1.
		assert(BackedgeWeight >= MinimumEdgeWeight &&
		LoopExitWeight >= MinimumEdgeWeight &&
		NotTakenOnceWeight >= MinimumEdgeWeight &&
		LoopPHWeight >= MinimumEdgeWeight &&
		"Invalid branch weight detected!");

		MDBuilder MDB(LatchBR->getFunction()->getContext());
		// Update the latch branch metadata. The loop preheader weight is the
		// same as the loop exit weight.
		LatchBR->setMetadata(
		LLVMContext::MD_prof,
		LoopBodyOnTrue ? MDB.createBranchWeights(BackedgeWeight, LoopPHWeight)
		: MDB.createBranchWeights(LoopPHWeight, BackedgeWeight));
		// Update the guard branch metadata.
		if (GuardBR->isUnconditional()) {
		// Guard branch has been simplified to an unconditional. It carries no
		// branch metadata.
		GuardBR->setMetadata(LLVMContext::MD_prof, nullptr);
		} else {
		GuardBR->setMetadata(
		LLVMContext::MD_prof,
		// Make sure we know where guard branch goes to on
		// true.
		GuardBR->getSuccessor(0) == L->getLoopPreheader()
		? MDB.createBranchWeights(LoopPHWeight, NotTakenOnceWeight)
		: MDB.createBranchWeights(NotTakenOnceWeight, LoopPHWeight));
		}
		}

/// Determine whether the instructions in this range may be safely and cheaply		/// Determine whether the instructions in this range may be safely and cheaply
/// speculated. This is not an important enough situation to develop complex		/// speculated. This is not an important enough situation to develop complex
/// heuristics. We handle a single arithmetic instruction along with any type		/// heuristics. We handle a single arithmetic instruction along with any type
/// conversions.		/// conversions.
static bool shouldSpeculateInstrs(BasicBlock::iterator Begin,		static bool shouldSpeculateInstrs(BasicBlock::iterator Begin,
BasicBlock::iterator End, Loop *L) {		BasicBlock::iterator End, Loop *L) {
bool seenIncrement = false;		bool seenIncrement = false;
bool MultiExitLoop = false;		bool MultiExitLoop = false;
▲ Show 20 Lines • Show All 208 Lines • Show Last 20 Lines

lib/Transforms/Utils/LoopUtils.cpp

Show First 20 Lines • Show All 1,100 Lines • ▼ Show 20 Lines	Optional<unsigned> llvm::getLoopEstimatedTripCount(Loop *L) {
uint64_t TrueVal, FalseVal;		uint64_t TrueVal, FalseVal;
if (!LatchBR->extractProfMetadata(TrueVal, FalseVal))		if (!LatchBR->extractProfMetadata(TrueVal, FalseVal))
return None;		return None;

if (!TrueVal \|\| !FalseVal)		if (!TrueVal \|\| !FalseVal)
return 0;		return 0;

// Divide the count of the backedge by the count of the edge exiting the loop,		// Divide the count of the backedge by the count of the edge exiting the loop,
// rounding to nearest.		// rounding to nearest. Trip count == backedge-taken count + 1.
if (LatchBR->getSuccessor(0) == L->getHeader())		if (LatchBR->getSuccessor(0) == L->getHeader())
return (TrueVal + (FalseVal / 2)) / FalseVal;		return (TrueVal + (FalseVal / 2)) / FalseVal + 1;
else		else
return (FalseVal + (TrueVal / 2)) / TrueVal;		return (FalseVal + (TrueVal / 2)) / TrueVal + 1;
}		}

test/Transforms/LoopRotate/loop-rotate-pgo.ll

This file was added.

				; RUN: opt -S -loop-rotate < %s \| FileCheck %s
				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-apple-macosx10.12.0"

				; This loop has a very large average trip count, we bias the guard branch highly
				; against exiting, i.e only 5% chance to exit.
				define i32 @loops_with_large_tripcount(i32 %n) {
				; CHECK: entry
				; CHECK: icmp slt i32 0, %n
				; CHECK: label %for.body.lr.ph, label %for.end, !prof !0
				davidxlUnsubmitted Not Done Reply Inline Actions Perhaps name it 'loop_with_known_trip_count' davidxl: Perhaps name it 'loop_with_known_trip_count'
				entry:
				br label %for.cond

				for.cond:
				%index = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
				%cmp = icmp slt i32 %index, %n
				br i1 %cmp, label %for.body, label %for.end, !prof !0
				davidxlUnsubmitted Not Done Reply Inline Actions This loop has statically known trip count. The profile data does not match it. Use a different meta data. davidxl: This loop has statically known trip count. The profile data does not match it. Use a different…

				for.body:
				br label %for.inc

				; CHECK: for.inc
				; CHECK: icmp slt i32 %inc, %n
				; CHECK-NEXT: label %for.body, label %for.cond.for.end_crit_edge, !prof !1
				for.inc:
				%inc = add nsw i32 %index, 1
				br label %for.cond

				for.end:
				ret i32 0
				}

				; This loop has average trip count bigger than 1 but no bigger than 5, we give
				davidxlUnsubmitted Not Done Reply Inline Actions add comment the loop body is by-passed most of the time. davidxl: add comment the loop body is by-passed most of the time.
				; the guard branch 50% chance to exit.
				define i32 @loops_with_moderately_large_tripcount(i32 %n) {
				; CHECK: entry
				; CHECK: icmp slt i32 0, %n
				davidxlUnsubmitted Not Done Reply Inline Actions explicitly check br instruction as well so that two branch targets are listed. !prof is meaningless without looking at the branch davidxl: explicitly check br instruction as well so that two branch targets are listed. !prof is…
				; CHECK: label %for.body.lr.ph, label %for.end, !prof !2
				entry:
				br label %for.cond

				for.cond:
				%index = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
				%cmp = icmp slt i32 %index, %n
				br i1 %cmp, label %for.body, label %for.end, !prof !1

				for.body:
				br label %for.inc

				; CHECK: for.inc
				; CHECK: icmp slt i32 %inc, %n
				davidxlUnsubmitted Not Done Reply Inline Actions Same here -- also check branch instruction davidxl: Same here -- also check branch instruction
				; CHECK-NEXT: label %for.body, label %for.cond.for.end_crit_edge, !prof !3
				for.inc:
				%inc = add nsw i32 %index, 1
				br label %for.cond

				for.end:
				ret i32 0
				}

				; This loop has a very small trip count, we bias the guard branch to have a
				danielcdhUnsubmitted Not Done Reply Inline Actions I'm a little confused the logic here. For the following code: L1: if (cond) { // taken probability 7/35 stmt; goto L1; } The loop rotate originally transforms it to: if (cond) { // taken probability 7/35 L1: stmt; if (cond) { // taken probability 7/35 goto L1; } } Your patch change it to: if (cond) { // taken probability 7/28 L1: stmt if (cond) { // taken probability 1/7 goto L1; } } Assuming "cond" has fixed taken probability, the original probability seems more correct to me? Or am I missing something? danielcdh: I'm a little confused the logic here. For the following code: L1: if (cond) { // taken…
				; very high chance to exit - 95%.
				define i32 @loops_with_small_tripcount(i32 %a) {
				; CHECK: icmp slt i32 15, %a
				; CHECK-NEXT: label %for.end, label %for.body.lr.ph, !prof !4
				entry:
				br label %for.cond

				for.cond:
				%index = phi i32 [ 15, %entry ], [ %inc, %for.inc ]
				%cmp = icmp slt i32 %index, %a
				br i1 %cmp, label %for.end, label %for.body, !prof !0

				for.body:
				br label %for.inc

				; CHECK: for.inc
				; CHECK: icmp slt i32 %inc, %a
				; CHECK-NEXT: label %for.cond.for.end_crit_edge, label %for.body, !prof !5
				for.inc:
				%inc = add nsw i32 %index, 1
				br label %for.cond

				for.end:
				ret i32 0
				}

				; This loop has average trip count smaller than 1 but no bigger than 1/5, we give
				; the guard branch 50% chance to exit.
				define i32 @loops_with_moderately_small_tripcount(i32 %a) {
				; CHECK: icmp slt i32 15, %a
				; CHECK-NEXT: label %for.end, label %for.body.lr.ph, !prof !6
				entry:
				br label %for.cond

				for.cond:
				%index = phi i32 [ 15, %entry ], [ %inc, %for.inc ]
				%cmp = icmp slt i32 %index, %a
				br i1 %cmp, label %for.end, label %for.body, !prof !1

				for.body:
				br label %for.inc

				; CHECK: for.inc
				; CHECK: icmp slt i32 %inc, %a
				; CHECK-NEXT: label %for.cond.for.end_crit_edge, label %for.body, !prof !0
				for.inc:
				%inc = add nsw i32 %index, 1
				br label %for.cond

				for.end:
				ret i32 0
				}

				; This is a loop with known trip count, make sure guard branch metadata is
				; deleted after the branch is simplified to an unconditional br.
				define i32 @loops_with_known_tripcount() {
				; Make sure entry has a branch with no metadata.
				; CHECK: entry
				; CHECK-NOT: !prof
				; CHECK: for.body:
				entry:
				br label %for.cond

				for.cond:
				%index = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
				%cmp = icmp slt i32 %index, 4
				br i1 %cmp, label %for.body, label %for.end, !prof !1

				for.body:
				br label %for.inc

				; CHECK: for.inc
				; CHECK: icmp slt i32 %inc, 4
				; CHECK-NEXT: label %for.body, label %for.end, !prof !6
				for.inc:
				%inc = add nsw i32 %index, 1
				br label %for.cond

				for.end:
				ret i32 0
				}


				; CHECK: !0 = !{!"branch_weights", i32 7, i32 1}
				; CHECK: !1 = !{!"branch_weights", i32 28, i32 7}
				; CHECK: !2 = !{!"branch_weights", i32 4, i32 3}
				; CHECK: !3 = !{!"branch_weights", i32 24, i32 4}
				; CHECK: !4 = !{!"branch_weights", i32 33, i32 2}
				; CHECK: !5 = !{!"branch_weights", i32 2, i32 5}
				; CHECK: !6 = !{!"branch_weights", i32 21, i32 7}
				!0 = !{!"branch_weights", i32 35, i32 7}
				!1 = !{!"branch_weights", i32 28, i32 7}
				!2 = !{!"branch_weights", i32 4, i32 1}

test/Transforms/LoopUnroll/peel-loop-pgo.ll

	; RUN: opt < %s -S -debug-only=loop-unroll -loop-unroll 2>&1 \| FileCheck %s			; RUN: opt < %s -S -debug-only=loop-unroll -loop-unroll 2>&1 \| FileCheck %s
	; REQUIRES: asserts			; REQUIRES: asserts

	; Make sure we use the profile information correctly to peel-off 3 iterations			; Make sure we use the profile information correctly to peel-off 4 iterations
	; from the loop, and update the branch weights for the peeled loop properly.			; from the loop, and update the branch weights for the peeled loop properly.

	; CHECK: Loop Unroll: F[basic]			; CHECK: Loop Unroll: F[basic]
	; CHECK: PEELING loop %for.body with iteration count 3!			; CHECK: PEELING loop %for.body with iteration count 4!
	; CHECK: Loop Unroll: F[optsize]			; CHECK: Loop Unroll: F[optsize]
	; CHECK-NOT: PEELING			; CHECK-NOT: PEELING

	; CHECK-LABEL: @basic			; CHECK-LABEL: @basic
	; CHECK: br i1 %{{.}}, label %[[NEXT0:.]], label %for.cond.for.end_crit_edge, !prof !1			; CHECK: br i1 %{{.}}, label %[[NEXT0:.]], label %for.cond.for.end_crit_edge, !prof !1
	; CHECK: [[NEXT0]]:			; CHECK: [[NEXT0]]:
	; CHECK: br i1 %{{.}}, label %[[NEXT1:.]], label %for.cond.for.end_crit_edge, !prof !2			; CHECK: br i1 %{{.}}, label %[[NEXT1:.]], label %for.cond.for.end_crit_edge, !prof !2
	; CHECK: [[NEXT1]]:			; CHECK: [[NEXT1]]:
	; CHECK: br i1 %{{.}}, label %[[NEXT2:.]], label %for.cond.for.end_crit_edge, !prof !3			; CHECK: br i1 %{{.}}, label %[[NEXT2:.]], label %for.cond.for.end_crit_edge, !prof !3
	; CHECK: [[NEXT2]]:			; CHECK: [[NEXT2]]:
	; CHECK: br i1 %{{.}}, label %for.body, label %{{.}}, !prof !4			; CHECK: br i1 %{{.}}, label %[[NEXT3:.]], label %for.cond.for.end_crit_edge, !prof !4
				; CHECK: [[NEXT3]]:
				; CHECK: br i1 %{{.}}, label %for.body, label %{{.}}, !prof !5

	define void @basic(i32* %p, i32 %k) #0 !prof !0 {			define void @basic(i32* %p, i32 %k) #0 !prof !0 {
	entry:			entry:
	%cmp3 = icmp slt i32 0, %k			%cmp3 = icmp slt i32 0, %k
	br i1 %cmp3, label %for.body.lr.ph, label %for.end			br i1 %cmp3, label %for.body.lr.ph, label %for.end

	for.body.lr.ph: ; preds = %entry			for.body.lr.ph: ; preds = %entry
	br label %for.body			br label %for.body
	▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind optsize }			attributes #1 = { nounwind optsize }

	!0 = !{!"function_entry_count", i64 1}			!0 = !{!"function_entry_count", i64 1}
	!1 = !{!"branch_weights", i32 3001, i32 1001}			!1 = !{!"branch_weights", i32 3001, i32 1001}

	;CHECK: !1 = !{!"branch_weights", i32 900, i32 101}			;CHECK: !1 = !{!"branch_weights", i32 900, i32 101}
	;CHECK: !2 = !{!"branch_weights", i32 540, i32 360}			;CHECK: !2 = !{!"branch_weights", i32 607, i32 293}
	;CHECK: !3 = !{!"branch_weights", i32 162, i32 378}			;CHECK: !3 = !{!"branch_weights", i32 273, i32 334}
	;CHECK: !4 = !{!"branch_weights", i32 1399, i32 162}			;CHECK: !4 = !{!"branch_weights", i32 61, i32 212}
				;CHECK: !5 = !{!"branch_weights", i32 1160, i32 61}