This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/IPO/
-
Transforms/
-
IPO/
7/14
SampleProfile.cpp
-
test/Transforms/SampleProfile/
-
Transforms/
-
SampleProfile/
2/4
pseudo-probe-icp-factor.ll

Differential D100993

[CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.
ClosedPublic

Authored by hoy on Apr 21 2021, 1:09 PM.

Download Raw Diff

Details

Reviewers

wenlei
wlei
wmi
davidxl

Commits

rG5f2d7300733b: [CSSPGO] Fix incorrect prorating indirect call distribution factor that leads…

Summary

Pseudo probe distribution factor is used to scale down profile samples to avoid misleading the counts inference due to the usage of "maximum" in getBlockWeight. For callsites, the scaling down can come from code duplication prior to the sample profile loader (prelink or postlink), or due to the indirect call promotion in sample loader inliner. This patch fixes an issue in sample loader ICP where the leftover indirect callsite scaling down causes the loss of non-promoted call target samples unexpectedly. While the scaling down is to favor BFI/BPI with accurate an callsite count, it doesn't fit in the current distribution factor that represents code duplication changes. Ideally, we would need two factors, one is for code duplication, the other is for ICP. However this seems over complicated. I'm going to trade one usage (callsite counts) for the other (call target counts).

Seeing perf win on one benchmark (mcf) of SPEC2017 with others unchanged.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	0 ms	x64 debian > libomptarget.mapping::declare_mapper_nested_default_mappers_array.cpp

Event Timeline

hoy created this revision.Apr 21 2021, 1:09 PM

Herald added subscribers: wenlei, hiraditya. · View Herald TranscriptApr 21 2021, 1:09 PM

hoy requested review of this revision.Apr 21 2021, 1:09 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2021, 1:09 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Updating D100993: [CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.

hoy edited the summary of this revision. (Show Details)Apr 21 2021, 1:39 PM

hoy added reviewers: wenlei, wlei, wmi, davidxl.

Harbormaster completed remote builds in B100081: Diff 339361.Apr 21 2021, 2:17 PM

wenlei added inline comments.Apr 21 2021, 2:38 PM

llvm/lib/Transforms/IPO/SampleProfile.cpp
835	This makes sure that the count for the indirect call instruction accurately reflects the counts for remaining indirect call. However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct? Is the problem here more about pre-link sample loader ICP screw up distribution factor so post-link sample loader ICP sees a smaller count for the icall instruction which blocks the post-link sample loader ICP? In this case, if we look at call site target counts from profile, with promoted targets excluded, we would arrive at the correct sum even without distribution factor? Relying on distribution could get us the same result, but using prorating to mimic subtraction is a bit weird.. I can see how accurate distribution factor helps count quality in general though - it would make counts for icall more consistent before running inference. On that, how do we make sure promoted calls get correct counts for inference input though?
845	nit: SumOrigin -> Original sum of target counts for indirect call before promoting given candidate. Sum -> Sum of remaining target counts for indirect call after promoting given candidate. There's no prorated sum - the Sum is adjust using subtraction Sum -= Candidate.CallsiteCount;
llvm/test/Transforms/SampleProfile/pseudo-probe-icp-factor.ll
261	What was the value before this fix? Is that much smaller than 0.95?

Harbormaster completed remote builds in B100087: Diff 339369.Apr 21 2021, 2:46 PM

hoy added inline comments.Apr 21 2021, 3:09 PM

llvm/lib/Transforms/IPO/SampleProfile.cpp
835	Yes, this is supposed to make sure that the indirect call and its enclosing block will have a correct count during sample annotation, but looks like doing it here is late. I just realized this is after `computeAndPropagateWeights`. Ideally, the distribution factor should be updated before `computeAndPropagateWeights` and the call target samples should be adjusted with the pre-updated factor. That means `updateIDTMetaData` should be done before `computeAndPropagateWeights`. The prelink problem doesn't need this to be fixed. We have the pseudo probe update pass that updates distribution factor based on BFI to handle that.
845	`Sum` is actually passed in as prorated. `Candidate.CallsiteCount` is also prorated. Maybe "Prorated sum of remaining target counts for indirect call after promoting given candidate"?
llvm/test/Transforms/SampleProfile/pseudo-probe-icp-factor.ll
261	It was 0.04 * 11259 = 450, where 0.04 was the distribution factor fixed up earlier in inlining time (line 255).

wenlei added inline comments.Apr 21 2021, 4:25 PM

llvm/lib/Transforms/IPO/SampleProfile.cpp
835	Ideally, the distribution factor should be updated before computeAndPropagateWeights and the call target samples should be adjusted with the pre-updated factor. Agreed, that ensures good input for inference. That means updateIDTMetaData should be done before computeAndPropagateWeights. I think that update is meant for value profile metadata, which is orthogonal to computeAndPropagateWeights. In fact, we always set metadata after inference is done. Is it possible to decouple the distribution factor setting from metadata update? The prelink problem doesn't need this to be fixed. We have the pseudo probe update pass that updates distribution factor based on BFI to handle that. Ok, then does this cause immediate problem for ICP (does the fix generate good result for xalancbmk without setting icp threshold to 2)? On that, how do we make sure promoted calls get correct counts for inference input though? Taking another look, line 891 of sampleprofile.cpp took care of this?
845	Ah, I see. You're right, sorry for the confusion.
llvm/test/Transforms/SampleProfile/pseudo-probe-icp-factor.ll
261	So with this fix, the distribution factor we see after sample loader changed from 0.04 to 0.95, correct?

hoy added inline comments.Apr 21 2021, 6:12 PM

llvm/lib/Transforms/IPO/SampleProfile.cpp
835	Is it possible to decouple the distribution factor setting from metadata update? They are coupled in this patch because the target counts from profile need to be prorated based on a pre-updated factor (say from prelink or inlining, but not ICP). The ICP-updated factor is the final one and is needed before `computeAndPropagateWeights`. Decoupling the ICP factor update from metadata update means we need to save the original factor somewhere, which can be possibly done using an intermediate table. On the other hand, the indirect call metadata update may not rely on `computeAndPropagateWeights`, unlike the branch metdata update. But separating them seems a bigger change and less understandable. Ok, then does this cause immediate problem for ICP (does the fix generate good result for xalancbmk without setting icp threshold to 2)? It causes less promotion with later PGO ICP since the early prorate eats the target count for non-promoted targets. I'm seeing this helps some benchmarks like perlbench and gcc, but regresses other benchmarks like h264. I guess that's due to the heuristics of PGO ICP. Overall no change in geomean. Taking another look, line 891 of sampleprofile.cpp took care of this? Yes.
llvm/test/Transforms/SampleProfile/pseudo-probe-icp-factor.ll
261	The distribution factor is still 0.04, but the call target count for bar is 8444 instead of 450.

wenlei added inline comments.Apr 22 2021, 12:42 AM

llvm/lib/Transforms/IPO/SampleProfile.cpp
835	Decoupling the ICP factor update from metadata update means we need to save the original factor somewhere, which can be possibly done using an intermediate table. My understanding: final_factor = original_factor * (remaining_counts / total_counts), and the original factor is simply Probe->Factor. Why do we need to save the original factor? Did I miss anything? On the other hand, the indirect call metadata update may not rely on computeAndPropagateWeights, unlike the branch metdata update. But separating them seems a bigger change and less understandable. If icall metadata update doesn't reply on inference, it's ok to set icall metadata before inference. But that "breaks" the high level structure of inference first, followed by metedata update to persist the result onto metadata, hence a bit inconsistent. It causes less promotion with later PGO ICP since the early prorate eats the target count for non-promoted targets. Hmm.. this goes back to my original question in earlier comments. "However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct?" I now see this `T = SampleRecord::adjustCallTargets(T.get(), Probe->Factor);` which explains why a smaller factor can screw up target counts and value profile metadata, in addition to block count. This part still looks fishy. It's reasonable to scale down all target counts when call site is duplicated. But if the factor is also used to get to correct remaining count for icall itself, scaling down all target count using such factor is not right. I understand you now set factor later to avoid scaling down targets in the same pass, but previous pass can still set a factor even when there's no call site duplication. As I understand, in ICP case, the original motivation for distribution factor is simply to account for call site duplication. If we stick to that, scaling down all targets is fine. Problem arise when we use distribution factor to mimic the effect of subtraction (of promoted target counts). Now the fix workaround the issue by setting distribution factor late, but I still feel using distribution factor to mimic subtraction is a bit hacky, and gives distribution factor inconsistent meaning (which is not compatible when used to scale down all targets). I hope we can have better solution, let's discuss tomorrow.

hoy added inline comments.Apr 22 2021, 9:31 AM

llvm/lib/Transforms/IPO/SampleProfile.cpp
835	My understanding: final_factor = original_factor * (remaining_counts / total_counts), and the original factor is simply Probe->Factor. Why do we need to save the original factor? Did I miss anything? The original factor, instead of the final factor, is used to scale down target counts, i.e, `SampleRecord::adjustCallTargets`. This is to make sure later PGO ICP see the right target count due to code duplication. The final factor is used to scale down the indirect call site counts which is used in `computeAndPropagateWeights`. Since `adjustCallTargets` currently runs after `computeAndPropagateWeights`, `original_factor` needs to be saved somewhere. Hmm.. this goes back to my original question in earlier comments. "However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct?" Right, callsite counts only affect BFI, but not PGO ICP. `SampleRecord::adjustCallTargets` affects ICP, and they use different factors, i.e, final factor and original factor, respectively. This part still looks fishy. It's reasonable to scale down all target counts when call site is duplicated. But if the factor is also used to get to correct remaining count for icall itself, scaling down all target count using such factor is not right. I understand you now set factor later to avoid scaling down targets in the same pass, but previous pass can still set a factor even when there's no call site duplication. It was wrong because we used final factor to scale down all target count. Now we are using the original factor, which could be from code duplication or prelink sample ICP. When it is from prelink sample ICP and we are at postlink, the promoted targets being scaled down here (which are wrong) will be ignored since they are known promoted from the magic number NOMORE_ICP_MAGICNUM in the callsite metadata.

Saving the original factor and use it to scale down target counts.

hoy added inline comments.Apr 22 2021, 9:38 AM

llvm/lib/Transforms/IPO/SampleProfile.cpp
835	Hmm, there seems still a problem with this approach. If the original factor comes from prelink sample ICP, and using it to scale down target counts in postlink doesn't make sense. Looks like we need two factors, one is for code duplication, the other is for ICP.

wenlei added inline comments.Apr 22 2021, 10:13 AM

llvm/lib/Transforms/IPO/SampleProfile.cpp
835	Hmm, there seems still a problem with this approach. If the original factor comes from prelink sample ICP, and using it to scale down target counts in postlink doesn't make sense. Looks like we need two factors, one is for code duplication, the other is for ICP. Yeah, this is exactly what I meant in my last reply. Fundamentally the problem is trying to use the factor which is supposed to handle duplication to also adjust remaining total counts for icall after ICP. The two usage are incompatible, moving the order only works around the issue for things happens in the same pass. Two factor would work, but adds complexity. How does AutoFDO handles this?

Harbormaster completed remote builds in B100308: Diff 339683.Apr 22 2021, 11:50 AM

hoy added inline comments.Apr 23 2021, 9:02 AM

llvm/lib/Transforms/IPO/SampleProfile.cpp
835	Discussed offline. Two factors seems over complicated. Going for a trade one usage (callsite counts) for the other (call target counts). Seeing perf win on one benchmark (mcf) of SPEC2017 with others unchanged.

Updating D100993: [CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.

wenlei added inline comments.Apr 23 2021, 9:12 AM

llvm/lib/Transforms/IPO/SampleProfile.cpp
468	This map is no longer needed?

hoy added inline comments.Apr 23 2021, 9:24 AM

llvm/lib/Transforms/IPO/SampleProfile.cpp
468	Oops, forgot to remove it.

Updating D100993: [CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.

lgtm, thanks for addressing the ICP issues.

This revision is now accepted and ready to land.Apr 23 2021, 9:27 AM

hoy edited the summary of this revision. (Show Details)Apr 23 2021, 9:34 AM

This revision was landed with ongoing or failed builds.Apr 23 2021, 11:09 AM

Closed by commit rG5f2d7300733b: [CSSPGO] Fix incorrect prorating indirect call distribution factor that leads… (authored by hoy). · Explain Why

This revision was automatically updated to reflect the committed changes.

hoy added a commit: rG5f2d7300733b: [CSSPGO] Fix incorrect prorating indirect call distribution factor that leads….

Harbormaster completed remote builds in B100592: Diff 340062.Apr 23 2021, 11:23 AM

Harbormaster completed remote builds in B100604: Diff 340078.Apr 23 2021, 12:04 PM

wenlei mentioned this in D102537: [CSSPGO] Overwrite branch weight annotated in previous pass..May 15 2021, 6:21 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

SampleProfile.cpp

15 lines

test/

Transforms/

SampleProfile/

pseudo-probe-icp-factor.ll

11 lines

Diff 339369

llvm/lib/Transforms/IPO/SampleProfile.cpp

Show First 20 Lines • Show All 459 Lines • ▼ Show 20 Lines	protected:
// External inline advisor used to replay inline decision from remarks.		// External inline advisor used to replay inline decision from remarks.
std::unique_ptr<ReplayInlineAdvisor> ExternalInlineAdvisor;		std::unique_ptr<ReplayInlineAdvisor> ExternalInlineAdvisor;

// A pseudo probe helper to correlate the imported sample counts.		// A pseudo probe helper to correlate the imported sample counts.
std::unique_ptr<PseudoProbeManager> ProbeManager;		std::unique_ptr<PseudoProbeManager> ProbeManager;
};		};

class SampleProfileLoaderLegacyPass : public ModulePass {		class SampleProfileLoaderLegacyPass : public ModulePass {
public:		public:
		wenleiUnsubmitted Not Done Reply Inline Actions This map is no longer needed? wenlei: This map is no longer needed?
		hoyAuthorUnsubmitted Done Reply Inline Actions Oops, forgot to remove it. hoy: Oops, forgot to remove it.
// Class identification, replacement for typeinfo		// Class identification, replacement for typeinfo
static char ID;		static char ID;

SampleProfileLoaderLegacyPass(		SampleProfileLoaderLegacyPass(
StringRef Name = SampleProfileFile,		StringRef Name = SampleProfileFile,
ThinOrFullLTOPhase LTOPhase = ThinOrFullLTOPhase::None)		ThinOrFullLTOPhase LTOPhase = ThinOrFullLTOPhase::None)
: ModulePass(ID), SampleLoader(		: ModulePass(ID), SampleLoader(
Name, SampleProfileRemappingFile, LTOPhase,		Name, SampleProfileRemappingFile, LTOPhase,
▲ Show 20 Lines • Show All 278 Lines • ▼ Show 20 Lines
/// we expect to use it to update the total count in the value profile.		/// we expect to use it to update the total count in the value profile.
static void		static void
updateIDTMetaData(Instruction &Inst,		updateIDTMetaData(Instruction &Inst,
const SmallVectorImpl<InstrProfValueData> &CallTargets,		const SmallVectorImpl<InstrProfValueData> &CallTargets,
uint64_t Sum) {		uint64_t Sum) {
uint32_t NumVals = 0;		uint32_t NumVals = 0;
// OldSum is the existing total count in the value profile data.		// OldSum is the existing total count in the value profile data.
uint64_t OldSum = 0;		uint64_t OldSum = 0;
		// FullSum is the sum of counts of all the targets, including promoted
		// targets.
		uint64_t FullSum = Sum;
std::unique_ptr<InstrProfValueData[]> ValueData =		std::unique_ptr<InstrProfValueData[]> ValueData =
std::make_unique<InstrProfValueData[]>(MaxNumPromotions);		std::make_unique<InstrProfValueData[]>(MaxNumPromotions);
bool Valid =		bool Valid =
getValueProfDataFromInst(Inst, IPVK_IndirectCallTarget, MaxNumPromotions,		getValueProfDataFromInst(Inst, IPVK_IndirectCallTarget, MaxNumPromotions,
ValueData.get(), NumVals, OldSum, true);		ValueData.get(), NumVals, OldSum, true);

DenseMap<uint64_t, uint64_t> ValueCountMap;		DenseMap<uint64_t, uint64_t> ValueCountMap;
if (Sum == 0) {		if (Sum == 0) {
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	llvm::sort(NewCallTargets,
return L.Count > R.Count;		return L.Count > R.Count;
return L.Value > R.Value;		return L.Value > R.Value;
});		});

uint32_t MaxMDCount =		uint32_t MaxMDCount =
std::min(NewCallTargets.size(), static_cast<size_t>(MaxNumPromotions));		std::min(NewCallTargets.size(), static_cast<size_t>(MaxNumPromotions));
annotateValueSite(*Inst.getParent()->getParent()->getParent(), Inst,		annotateValueSite(*Inst.getParent()->getParent()->getParent(), Inst,
NewCallTargets, Sum, IPVK_IndirectCallTarget, MaxMDCount);		NewCallTargets, Sum, IPVK_IndirectCallTarget, MaxMDCount);

		// Prorate the indirect callsite distribution.
		// FullSum is the sum including promoted targets, while sum is the sum
		// excluding promoted targets.
		if (FullSum)
		wenleiUnsubmitted Not Done Reply Inline Actions This makes sure that the count for the indirect call instruction accurately reflects the counts for remaining indirect call. However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct? Is the problem here more about pre-link sample loader ICP screw up distribution factor so post-link sample loader ICP sees a smaller count for the icall instruction which blocks the post-link sample loader ICP? In this case, if we look at call site target counts from profile, with promoted targets excluded, we would arrive at the correct sum even without distribution factor? Relying on distribution could get us the same result, but using prorating to mimic subtraction is a bit weird.. I can see how accurate distribution factor helps count quality in general though - it would make counts for icall more consistent before running inference. On that, how do we make sure promoted calls get correct counts for inference input though? wenlei: This makes sure that the count for the indirect call instruction accurately reflects the counts…
		hoyAuthorUnsubmitted Done Reply Inline Actions Yes, this is supposed to make sure that the indirect call and its enclosing block will have a correct count during sample annotation, but looks like doing it here is late. I just realized this is after `computeAndPropagateWeights`. Ideally, the distribution factor should be updated before `computeAndPropagateWeights` and the call target samples should be adjusted with the pre-updated factor. That means `updateIDTMetaData` should be done before `computeAndPropagateWeights`. The prelink problem doesn't need this to be fixed. We have the pseudo probe update pass that updates distribution factor based on BFI to handle that. hoy: Yes, this is supposed to make sure that the indirect call and its enclosing block will have a…
		wenleiUnsubmitted Not Done Reply Inline Actions Ideally, the distribution factor should be updated before computeAndPropagateWeights and the call target samples should be adjusted with the pre-updated factor. Agreed, that ensures good input for inference. That means updateIDTMetaData should be done before computeAndPropagateWeights. I think that update is meant for value profile metadata, which is orthogonal to computeAndPropagateWeights. In fact, we always set metadata after inference is done. Is it possible to decouple the distribution factor setting from metadata update? The prelink problem doesn't need this to be fixed. We have the pseudo probe update pass that updates distribution factor based on BFI to handle that. Ok, then does this cause immediate problem for ICP (does the fix generate good result for xalancbmk without setting icp threshold to 2)? On that, how do we make sure promoted calls get correct counts for inference input though? Taking another look, line 891 of sampleprofile.cpp took care of this? wenlei: > Ideally, the distribution factor should be updated before computeAndPropagateWeights and the…
		hoyAuthorUnsubmitted Done Reply Inline Actions Is it possible to decouple the distribution factor setting from metadata update? They are coupled in this patch because the target counts from profile need to be prorated based on a pre-updated factor (say from prelink or inlining, but not ICP). The ICP-updated factor is the final one and is needed before `computeAndPropagateWeights`. Decoupling the ICP factor update from metadata update means we need to save the original factor somewhere, which can be possibly done using an intermediate table. On the other hand, the indirect call metadata update may not rely on `computeAndPropagateWeights`, unlike the branch metdata update. But separating them seems a bigger change and less understandable. Ok, then does this cause immediate problem for ICP (does the fix generate good result for xalancbmk without setting icp threshold to 2)? It causes less promotion with later PGO ICP since the early prorate eats the target count for non-promoted targets. I'm seeing this helps some benchmarks like perlbench and gcc, but regresses other benchmarks like h264. I guess that's due to the heuristics of PGO ICP. Overall no change in geomean. Taking another look, line 891 of sampleprofile.cpp took care of this? Yes. hoy: > Is it possible to decouple the distribution factor setting from metadata update? They are…
		wenleiUnsubmitted Not Done Reply Inline Actions Decoupling the ICP factor update from metadata update means we need to save the original factor somewhere, which can be possibly done using an intermediate table. My understanding: final_factor = original_factor * (remaining_counts / total_counts), and the original factor is simply Probe->Factor. Why do we need to save the original factor? Did I miss anything? On the other hand, the indirect call metadata update may not rely on computeAndPropagateWeights, unlike the branch metdata update. But separating them seems a bigger change and less understandable. If icall metadata update doesn't reply on inference, it's ok to set icall metadata before inference. But that "breaks" the high level structure of inference first, followed by metedata update to persist the result onto metadata, hence a bit inconsistent. It causes less promotion with later PGO ICP since the early prorate eats the target count for non-promoted targets. Hmm.. this goes back to my original question in earlier comments. "However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct?" I now see this `T = SampleRecord::adjustCallTargets(T.get(), Probe->Factor);` which explains why a smaller factor can screw up target counts and value profile metadata, in addition to block count. This part still looks fishy. It's reasonable to scale down all target counts when call site is duplicated. But if the factor is also used to get to correct remaining count for icall itself, scaling down all target count using such factor is not right. I understand you now set factor later to avoid scaling down targets in the same pass, but previous pass can still set a factor even when there's no call site duplication. As I understand, in ICP case, the original motivation for distribution factor is simply to account for call site duplication. If we stick to that, scaling down all targets is fine. Problem arise when we use distribution factor to mimic the effect of subtraction (of promoted target counts). Now the fix workaround the issue by setting distribution factor late, but I still feel using distribution factor to mimic subtraction is a bit hacky, and gives distribution factor inconsistent meaning (which is not compatible when used to scale down all targets). I hope we can have better solution, let's discuss tomorrow. wenlei: > Decoupling the ICP factor update from metadata update means we need to save the original…
		hoyAuthorUnsubmitted Done Reply Inline Actions My understanding: final_factor = original_factor * (remaining_counts / total_counts), and the original factor is simply Probe->Factor. Why do we need to save the original factor? Did I miss anything? The original factor, instead of the final factor, is used to scale down target counts, i.e, `SampleRecord::adjustCallTargets`. This is to make sure later PGO ICP see the right target count due to code duplication. The final factor is used to scale down the indirect call site counts which is used in `computeAndPropagateWeights`. Since `adjustCallTargets` currently runs after `computeAndPropagateWeights`, `original_factor` needs to be saved somewhere. Hmm.. this goes back to my original question in earlier comments. "However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct?" Right, callsite counts only affect BFI, but not PGO ICP. `SampleRecord::adjustCallTargets` affects ICP, and they use different factors, i.e, final factor and original factor, respectively. This part still looks fishy. It's reasonable to scale down all target counts when call site is duplicated. But if the factor is also used to get to correct remaining count for icall itself, scaling down all target count using such factor is not right. I understand you now set factor later to avoid scaling down targets in the same pass, but previous pass can still set a factor even when there's no call site duplication. It was wrong because we used final factor to scale down all target count. Now we are using the original factor, which could be from code duplication or prelink sample ICP. When it is from prelink sample ICP and we are at postlink, the promoted targets being scaled down here (which are wrong) will be ignored since they are known promoted from the magic number NOMORE_ICP_MAGICNUM in the callsite metadata. hoy: > My understanding: final_factor = original_factor * (remaining_counts / total_counts), and the…
		hoyAuthorUnsubmitted Done Reply Inline Actions Hmm, there seems still a problem with this approach. If the original factor comes from prelink sample ICP, and using it to scale down target counts in postlink doesn't make sense. Looks like we need two factors, one is for code duplication, the other is for ICP. hoy: Hmm, there seems still a problem with this approach. If the original factor comes from prelink…
		wenleiUnsubmitted Not Done Reply Inline Actions Hmm, there seems still a problem with this approach. If the original factor comes from prelink sample ICP, and using it to scale down target counts in postlink doesn't make sense. Looks like we need two factors, one is for code duplication, the other is for ICP. Yeah, this is exactly what I meant in my last reply. Fundamentally the problem is trying to use the factor which is supposed to handle duplication to also adjust remaining total counts for icall after ICP. The two usage are incompatible, moving the order only works around the issue for things happens in the same pass. Two factor would work, but adds complexity. How does AutoFDO handles this? wenlei: > Hmm, there seems still a problem with this approach. If the original factor comes from…
		hoyAuthorUnsubmitted Done Reply Inline Actions Discussed offline. Two factors seems over complicated. Going for a trade one usage (callsite counts) for the other (call target counts). Seeing perf win on one benchmark (mcf) of SPEC2017 with others unchanged. hoy: Discussed offline. Two factors seems over complicated. Going for a trade one usage (callsite…
		if (Optional<PseudoProbe> Probe = extractProbe(Inst))
		setProbeDistributionFactor(Inst, (Probe->Factor * Sum) / FullSum);
}		}

/// Attempt to promote indirect call and also inline the promoted call.		/// Attempt to promote indirect call and also inline the promoted call.
///		///
/// \param F Caller function.		/// \param F Caller function.
/// \param Candidate ICP and inline candidate.		/// \param Candidate ICP and inline candidate.
/// \param Sum Sum of target counts for indirect call.		/// \param SumOrigin Original sum of target counts for indirect call.
		/// \param Sum Prorated sum of target counts for indirect call.
		wenleiUnsubmitted Not Done Reply Inline Actions nit: SumOrigin -> Original sum of target counts for indirect call before promoting given candidate. Sum -> Sum of remaining target counts for indirect call after promoting given candidate. There's no prorated sum - the Sum is adjust using subtraction Sum -= Candidate.CallsiteCount; wenlei: nit: SumOrigin -> Original sum of target counts for indirect call before promoting given…
		hoyAuthorUnsubmitted Done Reply Inline Actions `Sum` is actually passed in as prorated. `Candidate.CallsiteCount` is also prorated. Maybe "Prorated sum of remaining target counts for indirect call after promoting given candidate"? hoy: `Sum` is actually passed in as prorated. `Candidate.CallsiteCount` is also prorated. Maybe…
		wenleiUnsubmitted Not Done Reply Inline Actions Ah, I see. You're right, sorry for the confusion. wenlei: Ah, I see. You're right, sorry for the confusion.
/// \param InlinedCallSite Output vector for new call sites exposed after		/// \param InlinedCallSite Output vector for new call sites exposed after
/// inlining.		/// inlining.
bool SampleProfileLoader::tryPromoteAndInlineCandidate(		bool SampleProfileLoader::tryPromoteAndInlineCandidate(
Function &F, InlineCandidate &Candidate, uint64_t SumOrigin, uint64_t &Sum,		Function &F, InlineCandidate &Candidate, uint64_t SumOrigin, uint64_t &Sum,
SmallVector<CallBase , 8> InlinedCallSite) {		SmallVector<CallBase , 8> InlinedCallSite) {
auto CalleeFunctionName = Candidate.CalleeSamples->getFuncName();		auto CalleeFunctionName = Candidate.CalleeSamples->getFuncName();
auto R = SymbolMap.find(CalleeFunctionName);		auto R = SymbolMap.find(CalleeFunctionName);
if (R == SymbolMap.end() \|\| !R->getValue())		if (R == SymbolMap.end() \|\| !R->getValue())
Show All 18 Lines	if (!R->getValue()->isDeclaration() && R->getValue()->getSubprogram() &&
SmallVector<InstrProfValueData, 1> SortedCallTargets = {InstrProfValueData{		SmallVector<InstrProfValueData, 1> SortedCallTargets = {InstrProfValueData{
Function::getGUID(R->getValue()->getName()), NOMORE_ICP_MAGICNUM}};		Function::getGUID(R->getValue()->getName()), NOMORE_ICP_MAGICNUM}};
updateIDTMetaData(CI, SortedCallTargets, 0);		updateIDTMetaData(CI, SortedCallTargets, 0);

auto *DI = &pgo::promoteIndirectCall(		auto *DI = &pgo::promoteIndirectCall(
CI, R->getValue(), Candidate.CallsiteCount, Sum, false, ORE);		CI, R->getValue(), Candidate.CallsiteCount, Sum, false, ORE);
if (DI) {		if (DI) {
Sum -= Candidate.CallsiteCount;		Sum -= Candidate.CallsiteCount;
// Prorate the indirect callsite distribution.
// Do not update the promoted direct callsite distribution at this		// Do not update the promoted direct callsite distribution at this
// point since the original distribution combined with the callee		// point since the original distribution combined with the callee
// profile will be used to prorate callsites from the callee if		// profile will be used to prorate callsites from the callee if
// inlined. Once not inlined, the direct callsite distribution should		// inlined. Once not inlined, the direct callsite distribution should
// be prorated so that the it will reflect the real callsite counts.		// be prorated so that the it will reflect the real callsite counts.
setProbeDistributionFactor(CI, static_cast<float>(Sum) / SumOrigin);
Candidate.CallInstr = DI;		Candidate.CallInstr = DI;
if (isa<CallInst>(DI) \|\| isa<InvokeInst>(DI)) {		if (isa<CallInst>(DI) \|\| isa<InvokeInst>(DI)) {
bool Inlined = tryInlineCandidate(Candidate, InlinedCallSite);		bool Inlined = tryInlineCandidate(Candidate, InlinedCallSite);
if (!Inlined) {		if (!Inlined) {
// Prorate the direct callsite distribution so that it reflects real		// Prorate the direct callsite distribution so that it reflects real
// callsite counts.		// callsite counts.
setProbeDistributionFactor(		setProbeDistributionFactor(
*DI, static_cast<float>(Candidate.CallsiteCount) / SumOrigin);		*DI, static_cast<float>(Candidate.CallsiteCount) / SumOrigin);
▲ Show 20 Lines • Show All 1,073 Lines • Show Last 20 Lines

llvm/test/Transforms/SampleProfile/pseudo-probe-icp-factor.ll

	Show First 20 Lines • Show All 191 Lines • ▼ Show 20 Lines
	!43 = !DILocalVariable(name: "f", arg: 2, scope: !35, file: !1, line: 13, type: !38)			!43 = !DILocalVariable(name: "f", arg: 2, scope: !35, file: !1, line: 13, type: !38)
	!44 = !DILocation(line: 13, column: 13, scope: !35)			!44 = !DILocation(line: 13, column: 13, scope: !35)
	!45 = !{!46, !46, i64 0}			!45 = !{!46, !46, i64 0}
	!46 = !{!"any pointer", !21, i64 0}			!46 = !{!"any pointer", !21, i64 0}
	!47 = !DILocation(line: 13, column: 22, scope: !35)			!47 = !DILocation(line: 13, column: 22, scope: !35)
	!48 = !DILocation(line: 14, column: 10, scope: !35)			!48 = !DILocation(line: 14, column: 10, scope: !35)
	!49 = !DILocation(line: 14, column: 12, scope: !35)			!49 = !DILocation(line: 14, column: 12, scope: !35)
	!50 = !DILocation(line: 14, column: 10, scope: !51)			!50 = !DILocation(line: 14, column: 10, scope: !51)
				;; A discriminator of 108527639 which is 0x6780017 in hexdecimal, stands for an indirect call probe
				;; with an index of 2 and probe factor of 0.79.
	!51 = !DILexicalBlockFile(scope: !35, file: !1, discriminator: 108527639)			!51 = !DILexicalBlockFile(scope: !35, file: !1, discriminator: 108527639)
	!52 = !DILocation(line: 14, column: 3, scope: !35)			!52 = !DILocation(line: 14, column: 3, scope: !35)
	!53 = distinct !DISubprogram(name: "main", scope: !1, file: !1, line: 17, type: !54, scopeLine: 18, flags: DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition \| DISPFlagOptimized, unit: !0, retainedNodes: !56)			!53 = distinct !DISubprogram(name: "main", scope: !1, file: !1, line: 17, type: !54, scopeLine: 18, flags: DIFlagAllCallsDescribed, spFlags: DISPFlagDefinition \| DISPFlagOptimized, unit: !0, retainedNodes: !56)
	!54 = !DISubroutineType(types: !55)			!54 = !DISubroutineType(types: !55)
	!55 = !{!16}			!55 = !{!16}
	!56 = !{!57, !59, !60}			!56 = !{!57, !59, !60}
	!57 = !DILocalVariable(name: "x", scope: !53, file: !1, line: 19, type: !58)			!57 = !DILocalVariable(name: "x", scope: !53, file: !1, line: 19, type: !58)
	!58 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !14, size: 64)			!58 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !14, size: 64)
	Show All 18 Lines
	!77 = !DILocation(line: 27, column: 19, scope: !74)			!77 = !DILocation(line: 27, column: 19, scope: !74)
	!78 = !DILocation(line: 27, column: 10, scope: !75)			!78 = !DILocation(line: 27, column: 10, scope: !75)
	!79 = !DILocation(line: 28, column: 11, scope: !74)			!79 = !DILocation(line: 28, column: 11, scope: !74)
	!80 = !DILocation(line: 28, column: 9, scope: !74)			!80 = !DILocation(line: 28, column: 9, scope: !74)
	!81 = !DILocation(line: 30, column: 11, scope: !74)			!81 = !DILocation(line: 30, column: 11, scope: !74)
	!82 = !DILocation(line: 32, column: 17, scope: !75)			!82 = !DILocation(line: 32, column: 17, scope: !75)
	!83 = !DILocation(line: 32, column: 20, scope: !75)			!83 = !DILocation(line: 32, column: 20, scope: !75)
	!84 = !DILocation(line: 32, column: 13, scope: !85)			!84 = !DILocation(line: 32, column: 13, scope: !85)
				;; A discriminator of 116916311 which is 0x6f80057 in hexdecimal, stands for an indirect call probe
				;; with an index of 10 and probe factor of 0.95.
	!85 = !DILexicalBlockFile(scope: !75, file: !1, discriminator: 116916311)			!85 = !DILexicalBlockFile(scope: !75, file: !1, discriminator: 116916311)
	!86 = !DILocation(line: 32, column: 11, scope: !75)			!86 = !DILocation(line: 32, column: 11, scope: !75)
	!87 = !DILocation(line: 33, column: 5, scope: !75)			!87 = !DILocation(line: 33, column: 5, scope: !75)
	!88 = !DILocation(line: 26, column: 41, scope: !69)			!88 = !DILocation(line: 26, column: 41, scope: !69)
	!89 = distinct !{!89, !71, !90, !91}			!89 = distinct !{!89, !71, !90, !91}
	!90 = !DILocation(line: 33, column: 5, scope: !61)			!90 = !DILocation(line: 33, column: 5, scope: !61)
	!91 = !{!"llvm.loop.mustprogress"}			!91 = !{!"llvm.loop.mustprogress"}
	!92 = !DILocation(line: 34, column: 21, scope: !53)			!92 = !DILocation(line: 34, column: 21, scope: !53)
	!93 = !DILocation(line: 34, column: 5, scope: !94)			!93 = !DILocation(line: 34, column: 5, scope: !94)
	!94 = !DILexicalBlockFile(scope: !53, file: !1, discriminator: 104333335)			!94 = !DILexicalBlockFile(scope: !53, file: !1, discriminator: 104333335)
	!95 = !DILocation(line: 36, column: 1, scope: !53)			!95 = !DILocation(line: 36, column: 1, scope: !53)
	!96 = !DILocation(line: 35, column: 5, scope: !53)			!96 = !DILocation(line: 35, column: 5, scope: !53)

	; CHECK: %[[#]] = call i32 (i32, ...) %30(i32 %[[#]]) #[[#]], !dbg ![[#DBGID:]], !prof ![[#]]			; CHECK: define dso_local i32 @main
				; CHECK: %[[#]] = call i32 (i32, ...) %[[#]](i32 %[[#]]) #[[#]], !dbg ![[#DBGID:]], !prof ![[#PROF:]]

	;; A discriminator of 69206039 which is 0x4200017 in hexdecimal, stands for an indirect call probe			;; A discriminator of 69206039 which is 0x4200017 in hexdecimal, stands for an indirect call probe
	;; with an index of 2 and probe factor of 0.04.			;; with an index of 2 and probe factor of 0.04.
	; CHECK: ![[#DBGID]] = !DILocation(line: [[#]], column: [[#]], scope: ![[#SCOPE:]], inlinedAt: ![[#]])			; CHECK: ![[#DBGID]] = !DILocation(line: [[#]], column: [[#]], scope: ![[#SCOPE:]], inlinedAt: ![[#]])
	; CHECK: ![[#SCOPE]] = !DILexicalBlockFile(scope: ![[#]], file: ![[#]], discriminator: 69206039)			; CHECK: ![[#SCOPE]] = !DILexicalBlockFile(scope: ![[#]], file: ![[#]], discriminator: 69206039)

				;; The remaining count of the second target (bar) should be from the original count multiplied by two callsite
				;; factors, i.e, roughly 11259 * 0.95 * 0.79 = 8444.
				; CHECK: ![[#PROF]] = !{!"VP", i32 0, i64 8444, i64 7546896869197086323, i64 -1, i64 -2012135647395072713, i64 8444}
				wenleiUnsubmitted Not Done Reply Inline Actions What was the value before this fix? Is that much smaller than 0.95? wenlei: What was the value before this fix? Is that much smaller than 0.95?
				hoyAuthorUnsubmitted Done Reply Inline Actions It was 0.04 * 11259 = 450, where 0.04 was the distribution factor fixed up earlier in inlining time (line 255). hoy: It was 0.04 * 11259 = 450, where 0.04 was the distribution factor fixed up earlier in inlining…
				wenleiUnsubmitted Not Done Reply Inline Actions So with this fix, the distribution factor we see after sample loader changed from 0.04 to 0.95, correct? wenlei: So with this fix, the distribution factor we see after sample loader changed from 0.04 to 0.95…
				hoyAuthorUnsubmitted Done Reply Inline Actions The distribution factor is still 0.04, but the call target count for bar is 8444 instead of 450. hoy: The distribution factor is still 0.04, but the call target count for bar is 8444 instead of 450.

This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 339369

llvm/lib/Transforms/IPO/SampleProfile.cpp

llvm/test/Transforms/SampleProfile/pseudo-probe-icp-factor.ll

[CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.
ClosedPublic