This is an archive of the discontinued LLVM Phabricator instance.

Differential D112042

[llvm-profgen] Skip duplication factor outside of body sample computation
ClosedPublic

Authored by wenlei on Oct 18 2021, 5:46 PM.

Download Raw Diff

Details

Reviewers

hoy
wlei

Commits

rGe8c245dcd3b9: [llvm-profgen] Skip duplication factor outside of body sample computation

Summary

We incorrectly use duplication factor for total samples even though we already accumulate samples instead of taking MAX. It causes profile to have bloated total samples for functions with loop unrolled or vectorized. The change fix the issue for total sample, head sample and call target samples.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

wenlei created this revision.Oct 18 2021, 5:46 PM

Herald added subscribers: modimo, lxfind. · View Herald TranscriptOct 18 2021, 5:46 PM

wenlei requested review of this revision.Oct 18 2021, 5:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 18 2021, 5:46 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B129451: Diff 380555.Oct 18 2021, 6:26 PM

@rajeshwarv, checking on google's tool it looks the same issue exists there too - the total sample count could be bloated due to duplication factor. Wondering if the code linked below reflects the latest version used internally, or if there's any special reason for such computation? cc @davidxl

https://github.com/google/autofdo/blob/master/symbol_map.cc#L508

https://github.com/google/autofdo/blob/master/symbol_map.cc#L541-L544

Probably not an issue with FSAFDO discriminators?

Thanks for the fix, lgtm.

This revision is now accepted and ready to land.Oct 19 2021, 9:08 AM

In D112042#3071977, @davidxl wrote:

Probably not an issue with FSAFDO discriminators?

I think so. It only affects AutoFDO using duplication factor, which I assume is still in use for google internally?

Good to know. Just to clarify, you meant that the removal of duplication factor doesn’t show perf impact, right? You didn’t actually try to see if fixing the bloated total samples (as proposed in the patch) has perf impact?

Asking because I think the fix here would affect sample loader inlining for AutoFDO since it looks at total samples for inline decision. It may not have perf impact still depending on workloads. For our HHVM workload, one cold function was incorrectly accounted as the top 5 hot functions in profile due to this bug. But I’m surprised that this bug survived in the tools used by both of us through the years.. 😊

Thanks,
Wenlei

From: Rong Xu <xur@google.com>
Date: Tuesday, October 19, 2021 at 9:44 AM
To: Wenlei He <reviews+D112042+public+c17afeb67dc591e4@reviews.llvm.org>
Cc: Wenlei He <wenlei@fb.com>, Hongtao Yu <hoy@fb.com>, Lei Wang <wlei@fb.com>, davidxl@google.com <davidxl@google.com>, rajeshwarv@google.com <rajeshwarv@google.com>, llvm-commits@lists.llvm.org <llvm-commits@lists.llvm.org>, lxfind@gmail.com <lxfind@gmail.com>, Modi Mo <modimo@fb.com>, bhuvanendra.kumarn@amd.com <bhuvanendra.kumarn@amd.com>, yanliang.mu@intel.com <yanliang.mu@intel.com>, dougpuob@gmail.com <dougpuob@gmail.com>, michael.hliao@gmail.com <michael.hliao@gmail.com>, florian_hahn@apple.com <florian_hahn@apple.com>, david.green@arm.com <david.green@arm.com>, simon.moll@emea.nec.com <simon.moll@emea.nec.com>
Subject: Re: [PATCH] D112042: [llvm-profgen] Skip duplication factor outside of body sample computation

Your create_llvm_prof tool also has the same bug for AutoFDO as I commented earlier.

https://github.com/google/autofdo/blob/master/symbol_map.cc#L508

https://github.com/google/autofdo/blob/master/symbol_map.cc#L541-L544

Thanks,
Wenlei

From: Rong Xu <xur@google.com>
Date: Tuesday, October 19, 2021 at 11:23 AM
To: Wenlei He <wenlei@fb.com>
Cc: Wenlei He <reviews+D112042+public+c17afeb67dc591e4@reviews.llvm.org>, Hongtao Yu <hoy@fb.com>, Lei Wang <wlei@fb.com>, davidxl@google.com <davidxl@google.com>, rajeshwarv@google.com <rajeshwarv@google.com>, lxfind@gmail.com <lxfind@gmail.com>, Modi Mo <modimo@fb.com>
Subject: Re: [PATCH] D112042: [llvm-profgen] Skip duplication factor outside of body sample computation

Sorry. I missed that part.

The total count straight from create_llvm_prof is always garbage.
The reason for this is that create_llvm_prof does not disassemble the
instruction. It just uses address range to accumulate the counters.
I found this when I worked on FSAFDO. If I want to get good function total
counts, I always do another merge of the profile (profile_merger) -- this
way the entry count will be recomputed and it's the sum of the counter
inside.
I had some discussions with Wei about this. But we think the function's
total count is not really used in the compiler.
I did some experiments and did not find a performance difference with fixed
function total counts.

Our production build uses another pipeline (it has a disassembler) and does
not suffer this issue.

Overconting is unavoidable here since there is no basic block (all the
instructions with different offset will have a count).

-Rong

The total count straight from create_llvm_prof is always garbage.

I didn’t realize that until now..

The reason for this is that create_llvm_prof does not disassemble the instruction. It just uses address range to accumulate the counters.

Yes, that’s one source of over counting in create_llvm_prof, but the issue here is different, which is simply a bug in using duplication factor. Also note that llvm-profgen does disassemble instructions and calculates total count properly.

I had some discussions with Wei about this. But we think the function's total count is not really used in the compiler.

Well.. sample loader does look at total count for inline decisions. See code link below.

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/IPO/SampleProfile.cpp#L1082
->
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Utils/SampleProfileLoaderBaseUtil.cpp#L62

I found this when I worked on FSAFDO. If I want to get good function total counts, I always do another merge of the profile (profile_merger) -- this way the entry count will be recomputed and it's the sum of the counter inside.

Is profile_merger a separate tool you used internally? Wasn’t aware of it..

Our production build uses another pipeline (it has a disassembler) and does not suffer this issue.

Okay, good to know. I thought create_llvm_prof is used in your production pipeline..

Thanks,
Wenlei

From: Rong Xu <xur@google.com>
Date: Tuesday, October 19, 2021 at 2:03 PM
To: Wenlei He <wenlei@fb.com>
Cc: Wenlei He <reviews+D112042+public+c17afeb67dc591e4@reviews.llvm.org>, Hongtao Yu <hoy@fb.com>, Lei Wang <wlei@fb.com>, davidxl@google.com <davidxl@google.com>, rajeshwarv@google.com <rajeshwarv@google.com>, lxfind@gmail.com <lxfind@gmail.com>, Modi Mo <modimo@fb.com>
Subject: Re: [PATCH] D112042: [llvm-profgen] Skip duplication factor outside of body sample computation
Sorry. I missed that part.

The total count straight from create_llvm_prof is always garbage.
The reason for this is that create_llvm_prof does not disassemble the instruction. It just uses address range to accumulate the counters.
I found this when I worked on FSAFDO. If I want to get good function total counts, I always do another merge of the profile (profile_merger) -- this way the entry count will be recomputed and it's the sum of the counter inside.
I had some discussions with Wei about this. But we think the function's total count is not really used in the compiler.
I did some experiments and did not find a performance difference with fixed function total counts.

Our production build uses another pipeline (it has a disassembler) and does not suffer this issue.

Overconting is unavoidable here since there is no basic block (all the instructions with different offset will have a count).

-Rong

Closed by commit rGe8c245dcd3b9: [llvm-profgen] Skip duplication factor outside of body sample computation (authored by wenlei). · Explain WhyOct 19 2021, 11:18 PM

This revision was automatically updated to reflect the committed changes.

wenlei added a commit: rGe8c245dcd3b9: [llvm-profgen] Skip duplication factor outside of body sample computation.

I'm a late to this discussion (sorry about that) but I can confirm that this shouldn't impact our production. We don't use this tool/code in our workflows but thank you for pointing it out.

The reason for this is that create_llvm_prof does not disassemble the instruction. It just uses address range to accumulate the counters.

Should we fix it in create_llvm_prof as a post-pass in the tool? I see this creates a discrepancy between our production pipeline and this power-user case (create_llvm_prof), albeit one that is benign as of today.

In D112042#3075268, @rajeshwarv wrote:

I'm a late to this discussion (sorry about that) but I can confirm that this shouldn't impact our production. We don't use this tool/code in our workflows but thank you for pointing it out.

The reason for this is that create_llvm_prof does not disassemble the instruction. It just uses address range to accumulate the counters.

Should we fix it in create_llvm_prof as a post-pass in the tool? I see this creates a discrepancy between our production pipeline and this power-user case (create_llvm_prof), albeit one that is benign as of today.

Actually this show the benefit of having a unified tool, so one fix in one place can take care of all users. Now llvm-profgen is almost ready for AutoFDO too (it's fully ready for CSSPGO, and FS-AutoFDO will be supported soon too), how about we all focus our effort in llvm-profgen? Would it be reasonable for you to also use llvm-profgen for internal, non-prod, power-user case?

We plan to make llvm-profgen part of production pipeline soon, replacing an internal tool, and also use it for power user case for AutoFDO, CSSPGO, and FS-AutoFDO in the future.

Revision Contents

Path

Size

llvm/

test/

tools/

llvm-profgen/

inline-noprobe2.test

4 lines

tools/

llvm-profgen/

ProfileGenerator.cpp

20 lines

Diff 380860

llvm/test/tools/llvm-profgen/inline-noprobe2.test

	Show All 40 Lines
	;CHECK-NEXT: 4.2: swap:100			;CHECK-NEXT: 4.2: swap:100
	;CHECK-NEXT: 1: 25			;CHECK-NEXT: 1: 25
	;CHECK-NEXT: 2: 25			;CHECK-NEXT: 2: 25
	;CHECK-NEXT: 3: 25			;CHECK-NEXT: 3: 25
	;CHECK-NEXT: 5: swap:24			;CHECK-NEXT: 5: swap:24
	;CHECK-NEXT: 1: 6			;CHECK-NEXT: 1: 6
	;CHECK-NEXT: 2: 6			;CHECK-NEXT: 2: 6
	;CHECK-NEXT: 3: 6			;CHECK-NEXT: 3: 6
	;CHECK-NEXT: partition_pivot_last:647:7			;CHECK-NEXT: partition_pivot_last:389:7
	;CHECK-NEXT: 1: 6			;CHECK-NEXT: 1: 6
	;CHECK-NEXT: 2: 6			;CHECK-NEXT: 2: 6
	;CHECK-NEXT: 3: 6			;CHECK-NEXT: 3: 6

	;w/o duplication factor : 3.1: 18			;w/o duplication factor : 3.1: 18
	;w/o duplication factor : 3.3: 18			;w/o duplication factor : 3.3: 18
	;w/o duplication factor : 4: 19			;w/o duplication factor : 4: 19
	;w/o duplication factor : 5: 9			;w/o duplication factor : 5: 9
	;CHECK-NEXT: 3.1: 36			;CHECK-NEXT: 3.1: 36
	;CHECK-NEXT: 3.3: 36			;CHECK-NEXT: 3.3: 36
	;CHECK-NEXT: 4: 38			;CHECK-NEXT: 4: 38
	;CHECK-NEXT: 5: 18			;CHECK-NEXT: 5: 18

	;CHECK-NEXT: 6: 5			;CHECK-NEXT: 6: 5
	;CHECK-NEXT: 7: 5			;CHECK-NEXT: 7: 5
	;CHECK-NEXT: 5: swap:116			;CHECK-NEXT: 5: swap:61

	;w/o duplication factor : 1: 9			;w/o duplication factor : 1: 9
	;w/o duplication factor : 2: 9			;w/o duplication factor : 2: 9
	;w/o duplication factor : 3: 9			;w/o duplication factor : 3: 9
	;CHECK-NEXT: 1: 18			;CHECK-NEXT: 1: 18
	;CHECK-NEXT: 2: 18			;CHECK-NEXT: 2: 18
	;CHECK-NEXT: 3: 18			;CHECK-NEXT: 3: 18

	▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines

llvm/tools/llvm-profgen/ProfileGenerator.cpp

Show First 20 Lines • Show All 268 Lines • ▼ Show 20 Lines	void ProfileGeneratorBase::findDisjointRanges(RangeSample &DisjointRanges,
}		}
}		}

void ProfileGeneratorBase::updateBodySamplesforFunctionProfile(		void ProfileGeneratorBase::updateBodySamplesforFunctionProfile(
FunctionSamples &FunctionProfile, const SampleContextFrame &LeafLoc,		FunctionSamples &FunctionProfile, const SampleContextFrame &LeafLoc,
uint64_t Count) {		uint64_t Count) {
// Use the maximum count of samples with same line location		// Use the maximum count of samples with same line location
uint32_t Discriminator = getBaseDiscriminator(LeafLoc.Location.Discriminator);		uint32_t Discriminator = getBaseDiscriminator(LeafLoc.Location.Discriminator);

		// Use duplication factor to compensated for loop unroll/vectorization.
		// Note that this is only needed when we're taking MAX of the counts at
		// the location instead of SUM.
		Count *= getDuplicationFactor(LeafLoc.Location.Discriminator);

ErrorOr<uint64_t> R =		ErrorOr<uint64_t> R =
FunctionProfile.findSamplesAt(LeafLoc.Location.LineOffset, Discriminator);		FunctionProfile.findSamplesAt(LeafLoc.Location.LineOffset, Discriminator);

uint64_t PreviousCount = R ? R.get() : 0;		uint64_t PreviousCount = R ? R.get() : 0;
if (PreviousCount <= Count) {		if (PreviousCount <= Count) {
FunctionProfile.addBodySamples(LeafLoc.Location.LineOffset, Discriminator,		FunctionProfile.addBodySamples(LeafLoc.Location.LineOffset, Discriminator,
Count - PreviousCount);		Count - PreviousCount);
}		}
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	for (auto Range : preprocessRangeCounter(RangeCounter)) {
// Disjoint ranges may have range in the middle of two instr,		// Disjoint ranges may have range in the middle of two instr,
// e.g. If Instr1 at Addr1, and Instr2 at Addr2, disjoint range		// e.g. If Instr1 at Addr1, and Instr2 at Addr2, disjoint range
// can be Addr1+1 to Addr2-1. We should ignore such range.		// can be Addr1+1 to Addr2-1. We should ignore such range.
while (IP.Address <= RangeEnd) {		while (IP.Address <= RangeEnd) {
uint64_t Offset = Binary->virtualAddrToOffset(IP.Address);		uint64_t Offset = Binary->virtualAddrToOffset(IP.Address);
const SampleContextFrameVector &FrameVec =		const SampleContextFrameVector &FrameVec =
Binary->getFrameLocationStack(Offset);		Binary->getFrameLocationStack(Offset);
if (!FrameVec.empty()) {		if (!FrameVec.empty()) {
uint64_t DC = Count * getDuplicationFactor(
FrameVec.back().Location.Discriminator);
FunctionSamples &FunctionProfile =		FunctionSamples &FunctionProfile =
getLeafProfileAndAddTotalSamples(FrameVec, DC);		getLeafProfileAndAddTotalSamples(FrameVec, Count);
updateBodySamplesforFunctionProfile(FunctionProfile, FrameVec.back(),		updateBodySamplesforFunctionProfile(FunctionProfile, FrameVec.back(),
DC);		Count);
}		}
// Move to next IP within the range.		// Move to next IP within the range.
IP.advance();		IP.advance();
}		}
}		}
}		}

static bool isOutlinedFunction(StringRef CalleeName) {		static bool isOutlinedFunction(StringRef CalleeName) {
Show All 24 Lines	for (auto Entry : BranchCounters) {

StringRef CalleeName = getCalleeNameForOffset(TargetOffset);		StringRef CalleeName = getCalleeNameForOffset(TargetOffset);
if (CalleeName.size() == 0)		if (CalleeName.size() == 0)
continue;		continue;
// Record called target sample and its count.		// Record called target sample and its count.
const SampleContextFrameVector &FrameVec =		const SampleContextFrameVector &FrameVec =
Binary->getFrameLocationStack(SourceOffset);		Binary->getFrameLocationStack(SourceOffset);
if (!FrameVec.empty()) {		if (!FrameVec.empty()) {
Count *= getDuplicationFactor(FrameVec.back().Location.Discriminator);
FunctionSamples &FunctionProfile =		FunctionSamples &FunctionProfile =
getLeafProfileAndAddTotalSamples(FrameVec, Count);		getLeafProfileAndAddTotalSamples(FrameVec, Count);
FunctionProfile.addCalledTargetSamples(		FunctionProfile.addCalledTargetSamples(
FrameVec.back().Location.LineOffset,		FrameVec.back().Location.LineOffset,
getBaseDiscriminator(FrameVec.back().Location.Discriminator),		getBaseDiscriminator(FrameVec.back().Location.Discriminator),
CalleeName, Count);		CalleeName, Count);
}		}
// Add head samples for callee.		// Add head samples for callee.
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	for (auto Range : Ranges) {
// Disjoint ranges may have range in the middle of two instr,		// Disjoint ranges may have range in the middle of two instr,
// e.g. If Instr1 at Addr1, and Instr2 at Addr2, disjoint range		// e.g. If Instr1 at Addr1, and Instr2 at Addr2, disjoint range
// can be Addr1+1 to Addr2-1. We should ignore such range.		// can be Addr1+1 to Addr2-1. We should ignore such range.
while (IP.Address <= RangeEnd) {		while (IP.Address <= RangeEnd) {
uint64_t Offset = Binary->virtualAddrToOffset(IP.Address);		uint64_t Offset = Binary->virtualAddrToOffset(IP.Address);
auto LeafLoc = Binary->getInlineLeafFrameLoc(Offset);		auto LeafLoc = Binary->getInlineLeafFrameLoc(Offset);
if (LeafLoc.hasValue()) {		if (LeafLoc.hasValue()) {
// Recording body sample for this specific context		// Recording body sample for this specific context
uint64_t DC =		updateBodySamplesforFunctionProfile(FunctionProfile, *LeafLoc, Count);
Count * getDuplicationFactor(LeafLoc->Location.Discriminator);		FunctionProfile.addTotalSamples(Count);
updateBodySamplesforFunctionProfile(FunctionProfile, *LeafLoc, DC);
FunctionProfile.addTotalSamples(DC);
}		}

// Move to next IP within the range		// Move to next IP within the range
IP.advance();		IP.advance();
}		}
}		}
}		}

Show All 10 Lines	for (auto Entry : BranchCounters) {
StringRef CalleeName = getCalleeNameForOffset(TargetOffset);		StringRef CalleeName = getCalleeNameForOffset(TargetOffset);
if (CalleeName.size() == 0)		if (CalleeName.size() == 0)
continue;		continue;

// Record called target sample and its count		// Record called target sample and its count
auto LeafLoc = Binary->getInlineLeafFrameLoc(SourceOffset);		auto LeafLoc = Binary->getInlineLeafFrameLoc(SourceOffset);
if (!LeafLoc.hasValue())		if (!LeafLoc.hasValue())
continue;		continue;
Count *= getDuplicationFactor(LeafLoc->Location.Discriminator);
FunctionProfile.addCalledTargetSamples(		FunctionProfile.addCalledTargetSamples(
LeafLoc->Location.LineOffset,		LeafLoc->Location.LineOffset,
getBaseDiscriminator(LeafLoc->Location.Discriminator), CalleeName,		getBaseDiscriminator(LeafLoc->Location.Discriminator), CalleeName,
Count);		Count);

// Record head sample for called target(callee)		// Record head sample for called target(callee)
SampleContextFrameVector CalleeCtx(ContextId.begin(), ContextId.end());		SampleContextFrameVector CalleeCtx(ContextId.begin(), ContextId.end());
assert(CalleeCtx.back().FuncName == LeafLoc->FuncName &&		assert(CalleeCtx.back().FuncName == LeafLoc->FuncName &&
▲ Show 20 Lines • Show All 270 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[llvm-profgen] Skip duplication factor outside of body sample computationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 380860

llvm/test/tools/llvm-profgen/inline-noprobe2.test

llvm/tools/llvm-profgen/ProfileGenerator.cpp

[llvm-profgen] Skip duplication factor outside of body sample computation
ClosedPublic