This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.
ClosedPublic

Authored by hoy on Apr 21 2021, 1:09 PM.

Details

Summary

Pseudo probe distribution factor is used to scale down profile samples to avoid misleading the counts inference due to the usage of "maximum" in getBlockWeight. For callsites, the scaling down can come from code duplication prior to the sample profile loader (prelink or postlink), or due to the indirect call promotion in sample loader inliner. This patch fixes an issue in sample loader ICP where the leftover indirect callsite scaling down causes the loss of non-promoted call target samples unexpectedly. While the scaling down is to favor BFI/BPI with accurate an callsite count, it doesn't fit in the current distribution factor that represents code duplication changes. Ideally, we would need two factors, one is for code duplication, the other is for ICP. However this seems over complicated. I'm going to trade one usage (callsite counts) for the other (call target counts).

Seeing perf win on one benchmark (mcf) of SPEC2017 with others unchanged.

Diff Detail

Event Timeline

hoy created this revision.Apr 21 2021, 1:09 PM
hoy requested review of this revision.Apr 21 2021, 1:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2021, 1:09 PM
hoy updated this revision to Diff 339369.Apr 21 2021, 1:36 PM

Updating D100993: [CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.

hoy edited the summary of this revision. (Show Details)Apr 21 2021, 1:39 PM
hoy added reviewers: wenlei, wlei, wmi, davidxl.
wenlei added inline comments.Apr 21 2021, 2:38 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
835

This makes sure that the count for the indirect call instruction accurately reflects the counts for remaining indirect call. However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct?

Is the problem here more about pre-link sample loader ICP screw up distribution factor so post-link sample loader ICP sees a smaller count for the icall instruction which blocks the post-link sample loader ICP? In this case, if we look at call site target counts from profile, with promoted targets excluded, we would arrive at the correct sum even without distribution factor? Relying on distribution could get us the same result, but using prorating to mimic subtraction is a bit weird..

I can see how accurate distribution factor helps count quality in general though - it would make counts for icall more consistent before running inference. On that, how do we make sure promoted calls get correct counts for inference input though?

845

nit:

SumOrigin -> Original sum of target counts for indirect call before promoting given candidate.
Sum -> Sum of remaining target counts for indirect call after promoting given candidate.

There's no prorated sum - the Sum is adjust using subtraction

Sum -= Candidate.CallsiteCount;

llvm/test/Transforms/SampleProfile/pseudo-probe-icp-factor.ll
261

What was the value before this fix? Is that much smaller than 0.95?

hoy added inline comments.Apr 21 2021, 3:09 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
835

Yes, this is supposed to make sure that the indirect call and its enclosing block will have a correct count during sample annotation, but looks like doing it here is late. I just realized this is after computeAndPropagateWeights. Ideally, the distribution factor should be updated before computeAndPropagateWeights and the call target samples should be adjusted with the pre-updated factor. That means updateIDTMetaData should be done before computeAndPropagateWeights.

The prelink problem doesn't need this to be fixed. We have the pseudo probe update pass that updates distribution factor based on BFI to handle that.

845

Sum is actually passed in as prorated. Candidate.CallsiteCount is also prorated. Maybe "Prorated sum of remaining target counts for indirect call after promoting given candidate"?

llvm/test/Transforms/SampleProfile/pseudo-probe-icp-factor.ll
261

It was 0.04 * 11259 = 450, where 0.04 was the distribution factor fixed up earlier in inlining time (line 255).

wenlei added inline comments.Apr 21 2021, 4:25 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
835

Ideally, the distribution factor should be updated before computeAndPropagateWeights and the call target samples should be adjusted with the pre-updated factor.

Agreed, that ensures good input for inference.

That means updateIDTMetaData should be done before computeAndPropagateWeights.

I think that update is meant for value profile metadata, which is orthogonal to computeAndPropagateWeights. In fact, we always set metadata after inference is done. Is it possible to decouple the distribution factor setting from metadata update?

The prelink problem doesn't need this to be fixed. We have the pseudo probe update pass that updates distribution factor based on BFI to handle that.

Ok, then does this cause immediate problem for ICP (does the fix generate good result for xalancbmk without setting icp threshold to 2)?

On that, how do we make sure promoted calls get correct counts for inference input though?

Taking another look, line 891 of sampleprofile.cpp took care of this?

845

Ah, I see. You're right, sorry for the confusion.

llvm/test/Transforms/SampleProfile/pseudo-probe-icp-factor.ll
261

So with this fix, the distribution factor we see after sample loader changed from 0.04 to 0.95, correct?

hoy added inline comments.Apr 21 2021, 6:12 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
835

Is it possible to decouple the distribution factor setting from metadata update?

They are coupled in this patch because the target counts from profile need to be prorated based on a pre-updated factor (say from prelink or inlining, but not ICP). The ICP-updated factor is the final one and is needed before computeAndPropagateWeights. Decoupling the ICP factor update from metadata update means we need to save the original factor somewhere, which can be possibly done using an intermediate table.

On the other hand, the indirect call metadata update may not rely on computeAndPropagateWeights, unlike the branch metdata update. But separating them seems a bigger change and less understandable.

Ok, then does this cause immediate problem for ICP (does the fix generate good result for xalancbmk without setting icp threshold to 2)?

It causes less promotion with later PGO ICP since the early prorate eats the target count for non-promoted targets. I'm seeing this helps some benchmarks like perlbench and gcc, but regresses other benchmarks like h264. I guess that's due to the heuristics of PGO ICP. Overall no change in geomean.

Taking another look, line 891 of sampleprofile.cpp took care of this?

Yes.

llvm/test/Transforms/SampleProfile/pseudo-probe-icp-factor.ll
261

The distribution factor is still 0.04, but the call target count for bar is 8444 instead of 450.

wenlei added inline comments.Apr 22 2021, 12:42 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
835

Decoupling the ICP factor update from metadata update means we need to save the original factor somewhere, which can be possibly done using an intermediate table.

My understanding: final_factor = original_factor * (remaining_counts / total_counts), and the original factor is simply Probe->Factor. Why do we need to save the original factor? Did I miss anything?

On the other hand, the indirect call metadata update may not rely on computeAndPropagateWeights, unlike the branch metdata update. But separating them seems a bigger change and less understandable.

If icall metadata update doesn't reply on inference, it's ok to set icall metadata before inference. But that "breaks" the high level structure of inference first, followed by metedata update to persist the result onto metadata, hence a bit inconsistent.

It causes less promotion with later PGO ICP since the early prorate eats the target count for non-promoted targets.

Hmm.. this goes back to my original question in earlier comments. "However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct?"

I now see this T = SampleRecord::adjustCallTargets(T.get(), Probe->Factor); which explains why a smaller factor can screw up target counts and value profile metadata, in addition to block count.

This part still looks fishy. It's reasonable to scale down all target counts when call site is duplicated. But if the factor is also used to get to correct remaining count for icall itself, scaling down all target count using such factor is not right. I understand you now set factor later to avoid scaling down targets in the same pass, but previous pass can still set a factor even when there's no call site duplication.

As I understand, in ICP case, the original motivation for distribution factor is simply to account for call site duplication. If we stick to that, scaling down all targets is fine. Problem arise when we use distribution factor to mimic the effect of subtraction (of promoted target counts). Now the fix workaround the issue by setting distribution factor late, but I still feel using distribution factor to mimic subtraction is a bit hacky, and gives distribution factor inconsistent meaning (which is not compatible when used to scale down all targets). I hope we can have better solution, let's discuss tomorrow.

hoy added inline comments.Apr 22 2021, 9:31 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
835

My understanding: final_factor = original_factor * (remaining_counts / total_counts), and the original factor is simply Probe->Factor. Why do we need to save the original factor? Did I miss anything?

The original factor, instead of the final factor, is used to scale down target counts, i.e, SampleRecord::adjustCallTargets. This is to make sure later PGO ICP see the right target count due to code duplication. The final factor is used to scale down the indirect call site counts which is used in computeAndPropagateWeights. Since adjustCallTargets currently runs after computeAndPropagateWeights, original_factor needs to be saved somewhere.

Hmm.. this goes back to my original question in earlier comments. "However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct?"

Right, callsite counts only affect BFI, but not PGO ICP. SampleRecord::adjustCallTargets affects ICP, and they use different factors, i.e, final factor and original factor, respectively.

This part still looks fishy. It's reasonable to scale down all target counts when call site is duplicated. But if the factor is also used to get to correct remaining count for icall itself, scaling down all target count using such factor is not right. I understand you now set factor later to avoid scaling down targets in the same pass, but previous pass can still set a factor even when there's no call site duplication.

It was wrong because we used final factor to scale down all target count. Now we are using the original factor, which could be from code duplication or prelink sample ICP. When it is from prelink sample ICP and we are at postlink, the promoted targets being scaled down here (which are wrong) will be ignored since they are known promoted from the magic number NOMORE_ICP_MAGICNUM in the callsite metadata.

hoy updated this revision to Diff 339683.Apr 22 2021, 9:32 AM

Saving the original factor and use it to scale down target counts.

hoy added inline comments.Apr 22 2021, 9:38 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
835

Hmm, there seems still a problem with this approach. If the original factor comes from prelink sample ICP, and using it to scale down target counts in postlink doesn't make sense. Looks like we need two factors, one is for code duplication, the other is for ICP.

wenlei added inline comments.Apr 22 2021, 10:13 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
835

Hmm, there seems still a problem with this approach. If the original factor comes from prelink sample ICP, and using it to scale down target counts in postlink doesn't make sense. Looks like we need two factors, one is for code duplication, the other is for ICP.

Yeah, this is exactly what I meant in my last reply. Fundamentally the problem is trying to use the factor which is supposed to handle duplication to also adjust remaining total counts for icall after ICP. The two usage are incompatible, moving the order only works around the issue for things happens in the same pass. Two factor would work, but adds complexity. How does AutoFDO handles this?

hoy added inline comments.Apr 23 2021, 9:02 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
835

Discussed offline. Two factors seems over complicated. Going for a trade one usage (callsite counts) for the other (call target counts). Seeing perf win on one benchmark (mcf) of SPEC2017 with others unchanged.

hoy updated this revision to Diff 340062.Apr 23 2021, 9:03 AM

Updating D100993: [CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.

wenlei added inline comments.Apr 23 2021, 9:12 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
468

This map is no longer needed?

hoy added inline comments.Apr 23 2021, 9:24 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
468

Oops, forgot to remove it.

hoy updated this revision to Diff 340078.Apr 23 2021, 9:25 AM

Updating D100993: [CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.

wenlei accepted this revision.Apr 23 2021, 9:27 AM

lgtm, thanks for addressing the ICP issues.

This revision is now accepted and ready to land.Apr 23 2021, 9:27 AM
hoy edited the summary of this revision. (Show Details)Apr 23 2021, 9:34 AM
This revision was landed with ongoing or failed builds.Apr 23 2021, 11:09 AM
This revision was automatically updated to reflect the committed changes.