This is an archive of the discontinued LLVM Phabricator instance.

Scale frequencies of a set of blocks
ClosedPublic

Authored by eraman on Jan 10 2017, 2:19 PM.

Download Raw Diff

Details

Reviewers

chandlerc
davidxl

Commits

rG6c8f511f8280: Add an interface to scale the frequencies of a set of blocks.
rL292507: Add an interface to scale the frequencies of a set of blocks.

Summary

Scaling is done with respect to a reference block whose new frequency is specified.

Diff Detail

Repository: rL LLVM

Event Timeline

eraman updated this revision to Diff 83871.Jan 10 2017, 2:19 PM

eraman retitled this revision from to Scale frequencies of a set of blocks.

eraman updated this object.

eraman added reviewers: davidxl, chandlerc.

eraman added a subscriber: llvm-commits.

lgtm

This revision is now accepted and ready to land.Jan 10 2017, 2:57 PM

chandlerc added inline comments.Jan 10 2017, 3:03 PM

lib/Analysis/BlockFrequencyInfo.cpp
182 ↗	(On Diff #83871)	Extraneous blank line.
183 ↗	(On Diff #83871)	This does 2+N heap allocations where N is the number of basic blocks passed in.... Can we do something more efficient? Can we use some of the other scaling infrastructure that is already used in BFI such as ScaledNumber? At the very least, please don't declare a new APInt inside the loop.

Address Chandler's comments.

eraman marked an inline comment as done.Jan 10 2017, 4:53 PM

eraman added inline comments.

lib/Analysis/BlockFrequencyInfo.cpp
183 ↗	(On Diff #83871)	I can't think of a good way to efficiently do this with ScaledNumber. I have moved BBFreq declaration out of the loop. Now, there is a call to = operator inside the loop which calls a memset. If the right set of optimizations kick in this memset should reduce to a store. Does this look reasonable?

Ping.

Chandler, if you strongly prefer the use of ScaledNumber to APInt, I have a patch ready and will happily upload that. Even after this change, the multiply and divide operations allocate memory. Multiply seems similar in complexity (since we always multiply two 64 bit words). I don't know how Knuth's algorithm used in APInt compare with ScaledNumber's long division in the case when the dividend is more than 64 bits. The tradeoff here is the precision loss with ScaledNumber - whose effects in practice is unknown. I personally prefer to keep the APInt, but as I said above ready to switch to ScaledNumber to get this patch in.

In D28535#649966, @eraman wrote:

Chandler, if you strongly prefer the use of ScaledNumber to APInt, I have a patch ready and will happily upload that. Even after this change, the multiply and divide operations allocate memory. Multiply seems similar in complexity (since we always multiply two 64 bit words). I don't know how Knuth's algorithm used in APInt compare with ScaledNumber's long division in the case when the dividend is more than 64 bits. The tradeoff here is the precision loss with ScaledNumber - whose effects in practice is unknown. I personally prefer to keep the APInt, but as I said above ready to switch to ScaledNumber to get this patch in.

Sorry I missed your earlier update.

This isn't about using ScaledNumber to *directly* implement the algorithm you have, this is about using a different algorithm. Specifically, if you look at how BFI computes things, it goes to great lengths to accumulate and normalize the frequency for each block without doing division. Instead, it uses shifts and other techniques to scale things. But this isn't really compatible with the style of update you are using here.

Now that the memory allocation is gone, the part I am concerned with is the 'udiv'. Knuth's algorithm is still frighteningly expensive. But we can wait and revisit this once it shows up on a profile. Please comment clearly that this udiv is slow and that there are ways to remove it if it shows up for the future maintainer of the code.

I see three fundamental options that would eliminate this. Two are simpler but lose precision more rapidly, the other is more complex but retains precision:

Factor the division out of the loop into a pre-computed scale, and shift that scale and the input frequencies to maximize the precision that remains.
Reduce the precision a-priori to 32-bits so that a 64-bit multiply and divide suffice and then re-scale to the 64-bit values.
Recast this using the core BFI algorithm for normalizing block mass after scaling due to backedges (or inlining in this case). There might be a way to do this locally, I'm not sure. This would essentially mean adding live updates to the core BFI algorithm used.

#3 seems like it will have the best results but will be *very* hard to implement. Both #1 and #2 seem possible but I think its fine to wait and do that in a follow-up patch. I particularly like #1 because it isn't just hiding the cost in the hardware divide instruction which, for example, won't exist on 32-bit ARM hosts.

Anyways, feel free to submit now that the memory allocation is handled separately and with a comment that the udiv is a known likely hot spot.

Closed by commit rL292507: Add an interface to scale the frequencies of a set of blocks. (authored by eraman). · Explain WhyJan 19 2017, 11:04 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

BlockFrequencyInfo.h

6 lines

lib/

Analysis/

BlockFrequencyInfo.cpp

22 lines

unittests/

Analysis/

BlockFrequencyInfoTest.cpp

8 lines

Diff 84994

llvm/trunk/include/llvm/Analysis/BlockFrequencyInfo.h

Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	public:
/// \brief Returns the estimated profile count of \p Freq.		/// \brief Returns the estimated profile count of \p Freq.
/// This uses the frequency \p Freq and multiplies it by		/// This uses the frequency \p Freq and multiplies it by
/// the enclosing function's count (if available) and returns the value.		/// the enclosing function's count (if available) and returns the value.
Optional<uint64_t> getProfileCountFromFreq(uint64_t Freq) const;		Optional<uint64_t> getProfileCountFromFreq(uint64_t Freq) const;

// Set the frequency of the given basic block.		// Set the frequency of the given basic block.
void setBlockFreq(const BasicBlock *BB, uint64_t Freq);		void setBlockFreq(const BasicBlock *BB, uint64_t Freq);

		/// Set the frequency of \p ReferenceBB to \p Freq and scale the frequencies
		/// of the blocks in \p BlocksToScale such that their frequencies relative
		/// to \p ReferenceBB remain unchanged.
		void setBlockFreqAndScale(const BasicBlock *ReferenceBB, uint64_t Freq,
		SmallPtrSetImpl<BasicBlock *> &BlocksToScale);

/// calculate - compute block frequency info for the given function.		/// calculate - compute block frequency info for the given function.
void calculate(const Function &F, const BranchProbabilityInfo &BPI,		void calculate(const Function &F, const BranchProbabilityInfo &BPI,
const LoopInfo &LI);		const LoopInfo &LI);

// Print the block frequency Freq to OS using the current functions entry		// Print the block frequency Freq to OS using the current functions entry
// frequency to convert freq into a relative decimal form.		// frequency to convert freq into a relative decimal form.
raw_ostream &printBlockFreq(raw_ostream &OS, const BlockFrequency Freq) const;		raw_ostream &printBlockFreq(raw_ostream &OS, const BlockFrequency Freq) const;

▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/BlockFrequencyInfo.cpp

Show First 20 Lines • Show All 174 Lines • ▼ Show 20 Lines	BlockFrequencyInfo::getProfileCountFromFreq(uint64_t Freq) const {
return BFI->getProfileCountFromFreq(*getFunction(), Freq);		return BFI->getProfileCountFromFreq(*getFunction(), Freq);
}		}

void BlockFrequencyInfo::setBlockFreq(const BasicBlock *BB, uint64_t Freq) {		void BlockFrequencyInfo::setBlockFreq(const BasicBlock *BB, uint64_t Freq) {
assert(BFI && "Expected analysis to be available");		assert(BFI && "Expected analysis to be available");
BFI->setBlockFreq(BB, Freq);		BFI->setBlockFreq(BB, Freq);
}		}

		void BlockFrequencyInfo::setBlockFreqAndScale(
		const BasicBlock *ReferenceBB, uint64_t Freq,
		SmallPtrSetImpl<BasicBlock *> &BlocksToScale) {
		assert(BFI && "Expected analysis to be available");
		// Use 128 bits APInt to avoid overflow.
		APInt NewFreq(128, Freq);
		APInt OldFreq(128, BFI->getBlockFreq(ReferenceBB).getFrequency());
		APInt BBFreq(128, 0);
		for (auto *BB : BlocksToScale) {
		BBFreq = BFI->getBlockFreq(BB).getFrequency();
		// Multiply first by NewFreq and then divide by OldFreq
		// to minimize loss of precision.
		BBFreq *= NewFreq;
		// udiv is an expensive operation in the general case. If this ends up being
		// a hot spot, one of the options proposed in
		// https://reviews.llvm.org/D28535#650071 could be used to avoid this.
		BBFreq = BBFreq.udiv(OldFreq);
		BFI->setBlockFreq(BB, BBFreq.getLimitedValue());
		}
		BFI->setBlockFreq(ReferenceBB, Freq);
		}

/// Pop up a ghostview window with the current block frequency propagation		/// Pop up a ghostview window with the current block frequency propagation
/// rendered using dot.		/// rendered using dot.
void BlockFrequencyInfo::view() const {		void BlockFrequencyInfo::view() const {
// This code is only for debugging.		// This code is only for debugging.
#ifndef NDEBUG		#ifndef NDEBUG
ViewGraph(const_cast<BlockFrequencyInfo *>(this), "BlockFrequencyDAGs");		ViewGraph(const_cast<BlockFrequencyInfo *>(this), "BlockFrequencyDAGs");
#else		#else
errs() << "BlockFrequencyInfo::view is only available in debug builds on "		errs() << "BlockFrequencyInfo::view is only available in debug builds on "
▲ Show 20 Lines • Show All 90 Lines • Show Last 20 Lines

llvm/trunk/unittests/Analysis/BlockFrequencyInfoTest.cpp

Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	TEST_F(BlockFrequencyInfoTest, Basic) {
EXPECT_EQ(BB0Freq, BB3Freq);		EXPECT_EQ(BB0Freq, BB3Freq);
EXPECT_EQ(BB0Freq, BB1Freq + BB2Freq);		EXPECT_EQ(BB0Freq, BB1Freq + BB2Freq);
EXPECT_EQ(BB0Freq, BB3Freq);		EXPECT_EQ(BB0Freq, BB3Freq);

EXPECT_EQ(BFI.getBlockProfileCount(&BB0).getValue(), UINT64_C(100));		EXPECT_EQ(BFI.getBlockProfileCount(&BB0).getValue(), UINT64_C(100));
EXPECT_EQ(BFI.getBlockProfileCount(BB3).getValue(), UINT64_C(100));		EXPECT_EQ(BFI.getBlockProfileCount(BB3).getValue(), UINT64_C(100));
EXPECT_EQ(BFI.getBlockProfileCount(BB1).getValue(), 100 * BB1Freq / BB0Freq);		EXPECT_EQ(BFI.getBlockProfileCount(BB1).getValue(), 100 * BB1Freq / BB0Freq);
EXPECT_EQ(BFI.getBlockProfileCount(BB2).getValue(), 100 * BB2Freq / BB0Freq);		EXPECT_EQ(BFI.getBlockProfileCount(BB2).getValue(), 100 * BB2Freq / BB0Freq);

		// Scale the frequencies of BB0, BB1 and BB2 by a factor of two.
		SmallPtrSet<BasicBlock *, 4> BlocksToScale({BB1, BB2});
		BFI.setBlockFreqAndScale(&BB0, BB0Freq * 2, BlocksToScale);
		EXPECT_EQ(BFI.getBlockFreq(&BB0).getFrequency(), 2 * BB0Freq);
		EXPECT_EQ(BFI.getBlockFreq(BB1).getFrequency(), 2 * BB1Freq);
		EXPECT_EQ(BFI.getBlockFreq(BB2).getFrequency(), 2 * BB2Freq);
		EXPECT_EQ(BFI.getBlockFreq(BB3).getFrequency(), BB3Freq);
}		}

} // end anonymous namespace		} // end anonymous namespace
} // end namespace llvm		} // end namespace llvm

This is an archive of the discontinued LLVM Phabricator instance.

Scale frequencies of a set of blocksClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 84994

llvm/trunk/include/llvm/Analysis/BlockFrequencyInfo.h

llvm/trunk/lib/Analysis/BlockFrequencyInfo.cpp

llvm/trunk/unittests/Analysis/BlockFrequencyInfoTest.cpp

Scale frequencies of a set of blocks
ClosedPublic