This is an archive of the discontinued LLVM Phabricator instance.

[llvm-profgen] Skip duplication factor outside of body sample computation
ClosedPublic

Authored by wenlei on Oct 18 2021, 5:46 PM.

Details

Summary

We incorrectly use duplication factor for total samples even though we already accumulate samples instead of taking MAX. It causes profile to have bloated total samples for functions with loop unrolled or vectorized. The change fix the issue for total sample, head sample and call target samples.

Diff Detail

Event Timeline

wenlei created this revision.Oct 18 2021, 5:46 PM
wenlei requested review of this revision.Oct 18 2021, 5:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 18 2021, 5:46 PM

@rajeshwarv, checking on google's tool it looks the same issue exists there too - the total sample count could be bloated due to duplication factor. Wondering if the code linked below reflects the latest version used internally, or if there's any special reason for such computation? cc @davidxl

https://github.com/google/autofdo/blob/master/symbol_map.cc#L508

https://github.com/google/autofdo/blob/master/symbol_map.cc#L541-L544

davidxl added a subscriber: xur.Oct 18 2021, 9:31 PM

Probably not an issue with FSAFDO discriminators?

hoy accepted this revision.Oct 19 2021, 9:08 AM

Thanks for the fix, lgtm.

This revision is now accepted and ready to land.Oct 19 2021, 9:08 AM

Probably not an issue with FSAFDO discriminators?

I think so. It only affects AutoFDO using duplication factor, which I assume is still in use for google internally?

wenlei added subscribers: hoy, wlei.Oct 19 2021, 10:00 AM

Good to know. Just to clarify, you meant that the removal of duplication factor doesn’t show perf impact, right? You didn’t actually try to see if fixing the bloated total samples (as proposed in the patch) has perf impact?

Asking because I think the fix here would affect sample loader inlining for AutoFDO since it looks at total samples for inline decision. It may not have perf impact still depending on workloads. For our HHVM workload, one cold function was incorrectly accounted as the top 5 hot functions in profile due to this bug. But I’m surprised that this bug survived in the tools used by both of us through the years.. 😊

Thanks,
Wenlei

From: Rong Xu <xur@google.com>
Date: Tuesday, October 19, 2021 at 9:44 AM
To: Wenlei He <reviews+D112042+public+c17afeb67dc591e4@reviews.llvm.org>
Cc: Wenlei He <wenlei@fb.com>, Hongtao Yu <hoy@fb.com>, Lei Wang <wlei@fb.com>, davidxl@google.com <davidxl@google.com>, rajeshwarv@google.com <rajeshwarv@google.com>, llvm-commits@lists.llvm.org <llvm-commits@lists.llvm.org>, lxfind@gmail.com <lxfind@gmail.com>, Modi Mo <modimo@fb.com>, bhuvanendra.kumarn@amd.com <bhuvanendra.kumarn@amd.com>, yanliang.mu@intel.com <yanliang.mu@intel.com>, dougpuob@gmail.com <dougpuob@gmail.com>, michael.hliao@gmail.com <michael.hliao@gmail.com>, florian_hahn@apple.com <florian_hahn@apple.com>, david.green@arm.com <david.green@arm.com>, simon.moll@emea.nec.com <simon.moll@emea.nec.com>
Subject: Re: [PATCH] D112042: [llvm-profgen] Skip duplication factor outside of body sample computation

Your create_llvm_prof tool also has the same bug for AutoFDO as I commented earlier.

https://github.com/google/autofdo/blob/master/symbol_map.cc#L508

https://github.com/google/autofdo/blob/master/symbol_map.cc#L541-L544

Thanks,
Wenlei

From: Rong Xu <xur@google.com>
Date: Tuesday, October 19, 2021 at 11:23 AM
To: Wenlei He <wenlei@fb.com>
Cc: Wenlei He <reviews+D112042+public+c17afeb67dc591e4@reviews.llvm.org>, Hongtao Yu <hoy@fb.com>, Lei Wang <wlei@fb.com>, davidxl@google.com <davidxl@google.com>, rajeshwarv@google.com <rajeshwarv@google.com>, lxfind@gmail.com <lxfind@gmail.com>, Modi Mo <modimo@fb.com>
Subject: Re: [PATCH] D112042: [llvm-profgen] Skip duplication factor outside of body sample computation

xur added a subscriber: wenlei.Oct 19 2021, 2:03 PM

Sorry. I missed that part.

The total count straight from create_llvm_prof is always garbage.
The reason for this is that create_llvm_prof does not disassemble the
instruction. It just uses address range to accumulate the counters.
I found this when I worked on FSAFDO. If I want to get good function total
counts, I always do another merge of the profile (profile_merger) -- this
way the entry count will be recomputed and it's the sum of the counter
inside.
I had some discussions with Wei about this. But we think the function's
total count is not really used in the compiler.
I did some experiments and did not find a performance difference with fixed
function total counts.

Our production build uses another pipeline (it has a disassembler) and does
not suffer this issue.

Overconting is unavoidable here since there is no basic block (all the
instructions with different offset will have a count).

-Rong

The total count straight from create_llvm_prof is always garbage.

I didn’t realize that until now..

The reason for this is that create_llvm_prof does not disassemble the instruction. It just uses address range to accumulate the counters.

Yes, that’s one source of over counting in create_llvm_prof, but the issue here is different, which is simply a bug in using duplication factor. Also note that llvm-profgen does disassemble instructions and calculates total count properly.

I had some discussions with Wei about this. But we think the function's total count is not really used in the compiler.

Well.. sample loader does look at total count for inline decisions. See code link below.

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/IPO/SampleProfile.cpp#L1082
->
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Utils/SampleProfileLoaderBaseUtil.cpp#L62

I found this when I worked on FSAFDO. If I want to get good function total counts, I always do another merge of the profile (profile_merger) -- this way the entry count will be recomputed and it's the sum of the counter inside.

Is profile_merger a separate tool you used internally? Wasn’t aware of it..

Our production build uses another pipeline (it has a disassembler) and does not suffer this issue.

Okay, good to know. I thought create_llvm_prof is used in your production pipeline..

Thanks,
Wenlei

From: Rong Xu <xur@google.com>
Date: Tuesday, October 19, 2021 at 2:03 PM
To: Wenlei He <wenlei@fb.com>
Cc: Wenlei He <reviews+D112042+public+c17afeb67dc591e4@reviews.llvm.org>, Hongtao Yu <hoy@fb.com>, Lei Wang <wlei@fb.com>, davidxl@google.com <davidxl@google.com>, rajeshwarv@google.com <rajeshwarv@google.com>, lxfind@gmail.com <lxfind@gmail.com>, Modi Mo <modimo@fb.com>
Subject: Re: [PATCH] D112042: [llvm-profgen] Skip duplication factor outside of body sample computation
Sorry. I missed that part.

The total count straight from create_llvm_prof is always garbage.
The reason for this is that create_llvm_prof does not disassemble the instruction. It just uses address range to accumulate the counters.
I found this when I worked on FSAFDO. If I want to get good function total counts, I always do another merge of the profile (profile_merger) -- this way the entry count will be recomputed and it's the sum of the counter inside.
I had some discussions with Wei about this. But we think the function's total count is not really used in the compiler.
I did some experiments and did not find a performance difference with fixed function total counts.

Our production build uses another pipeline (it has a disassembler) and does not suffer this issue.

Overconting is unavoidable here since there is no basic block (all the instructions with different offset will have a count).

-Rong

I'm a late to this discussion (sorry about that) but I can confirm that this shouldn't impact our production. We don't use this tool/code in our workflows but thank you for pointing it out.

The reason for this is that create_llvm_prof does not disassemble the instruction. It just uses address range to accumulate the counters.

Should we fix it in create_llvm_prof as a post-pass in the tool? I see this creates a discrepancy between our production pipeline and this power-user case (create_llvm_prof), albeit one that is benign as of today.

I'm a late to this discussion (sorry about that) but I can confirm that this shouldn't impact our production. We don't use this tool/code in our workflows but thank you for pointing it out.

The reason for this is that create_llvm_prof does not disassemble the instruction. It just uses address range to accumulate the counters.

Should we fix it in create_llvm_prof as a post-pass in the tool? I see this creates a discrepancy between our production pipeline and this power-user case (create_llvm_prof), albeit one that is benign as of today.

Actually this show the benefit of having a unified tool, so one fix in one place can take care of all users. Now llvm-profgen is almost ready for AutoFDO too (it's fully ready for CSSPGO, and FS-AutoFDO will be supported soon too), how about we all focus our effort in llvm-profgen? Would it be reasonable for you to also use llvm-profgen for internal, non-prod, power-user case?

We plan to make llvm-profgen part of production pipeline soon, replacing an internal tool, and also use it for power user case for AutoFDO, CSSPGO, and FS-AutoFDO in the future.