This is an archive of the discontinued LLVM Phabricator instance.

[AsmPrinter][PGO] Adds optional dumping of branch probabilities for PGO metrics.
Needs ReviewPublic

Authored by red1bluelost on Aug 25 2023, 2:10 PM.

Details

Summary

[1/2] This is the first of two patches for branch accuracy metrics. Second at D158890.

This patch adds optional dumping of branch probabilities that can be used for collecting branch
accuracy metrics. Our metrics compare branch probabilities against execution traces. This patch is
for dumping those branch probabilities.

Testing:
New llvm test to verify branch probabilities are dumped as expected on x86_64.
Ran check-llvm with x86 enabled.
Probability dumping also testing using the scripts in the second patch of the series.

Diff Detail

Event Timeline

red1bluelost created this revision.Aug 25 2023, 2:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 25 2023, 2:10 PM
red1bluelost requested review of this revision.Aug 25 2023, 2:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 25 2023, 2:10 PM
red1bluelost edited the summary of this revision. (Show Details)Aug 25 2023, 2:11 PM
wenlei added a subscriber: hoy.Aug 25 2023, 2:20 PM

We have something similar internally, but didn't upstream because we are not sure if the use case is too narrow to justify burdening the code base with all the related complexity. cc @hoy

Rebased patch to fix buildkite.

mtrofin accepted this revision.Aug 25 2023, 3:14 PM

This is great! Can you share what tracing methodology you use?

LGTM with some nits.

llvm/include/llvm/CodeGen/AsmPrinter.h
390

nit: init at declaration (both fields)

llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
1657

why not just use EnableBranchProbabilityDumping?

llvm/test/Transforms/PGOProfile/asm_emit_branch_prob.ll
65

nit: line up the CHECK

This revision is now accepted and ready to land.Aug 25 2023, 3:14 PM
mtrofin added inline comments.Aug 25 2023, 3:15 PM
llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
1760

what about virtual calls?

BTW, have you considered using -mbb-profile-dump (defined in AsmPrinter.cpp) - same intention, just MBB frequencies, and dumping to a csv; also assumes compilation happened with -basic-block-sections=labels

This is great! Can you share what tracing methodology you use?

We used Pin Tool to collect branch traces and a basic python script to process the data and generate CSVs
and stats. Those are in the second patch D158890. We've been able to use these metrics on x86 and
internally with a slightly modified script but it should be able to support other targets that can collect
branch traces.

BTW, have you considered using -mbb-profile-dump (defined in AsmPrinter.cpp) - same intention, just MBB frequencies, and dumping to a csv; also assumes compilation happened with -basic-block-sections=labels

We had not considered that. We mainly focused in on branch weights since it is a primary result of
SamplePGO. Definitely will look into if similar metrics can be done :)

hoy added a comment.EditedAug 26 2023, 12:24 PM

We have something similar internally, but didn't upstream because we are not sure if the use case is too narrow to justify burdening the code base with all the related complexity. cc @hoy

Yeah, we internally serialize machine block execution counts to the binary. Would it be helpful to write those counts into some bb section similar to the llvm_bb_addr_map section?

Addresses nits.

red1bluelost marked 2 inline comments as done.Aug 28 2023, 7:51 AM
red1bluelost added inline comments.
llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
1657

Since we are reading from a global variable and the following loop it probably hot code, I thought it would be better to cache it outside the loop rather than keep referencing the global.

1760

I don't think MachineInstr has a notion of virtual calls. Unless you mean something else or I'm missing something.

We have something similar internally, but didn't upstream because we are not sure if the use case is too narrow to justify burdening the code base with all the related complexity. cc @hoy

Yeah, we internally serialize machine block execution counts to the binary. Would it be helpful to write those counts into some bb section similar to the llvm_bb_addr_map section?

This is interesting and similar but our metrics are only focusing on branch weights at the moment. We've focused on branch weights so far since they are a direct result of Profile Loading.

At least for the current state of the metrics here and consumed in D158890, there is not a use for block counts but it could be a future extension of the metrics.

mtrofin added inline comments.Aug 28 2023, 8:58 AM
llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
1760

Correct, but (I mean this as brainstorming, not blocking this patch - which still lgtm) you could for instance record "there's an indirect call here, with <profile>" and correlate with your trace. I.e. "are we missing out on indirect call promotion opportunities".

hoy added a comment.Aug 28 2023, 10:54 AM

We have something similar internally, but didn't upstream because we are not sure if the use case is too narrow to justify burdening the code base with all the related complexity. cc @hoy

Yeah, we internally serialize machine block execution counts to the binary. Would it be helpful to write those counts into some bb section similar to the llvm_bb_addr_map section?

This is interesting and similar but our metrics are only focusing on branch weights at the moment. We've focused on branch weights so far since they are a direct result of Profile Loading.

At least for the current state of the metrics here and consumed in D158890, there is not a use for block counts but it could be a future extension of the metrics.

Thanks for the information. Maybe we can look into encoding block counts metrics somewhere.

A bit more context about our use case: we use block counts to check if the compiler does a good job maintaining the input profile quality throughout its pipeline. Using branch weights would require an offline BFI computation to recover the block counts based on the branch weights, which may not give as accurate information as the profiler (such as the LBR profiler) gives.

red1bluelost added inline comments.Aug 29 2023, 1:46 PM
llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
1760

That definitely sounds like a good idea. Our encoded reserved space just for this type of extra info. Agreed it'd fit in a later patch :)

@mtrofin Would you be able to commit on my behalf? (Name: Micah Weston, Email: micahsweston@gmail.com)

MaskRay added inline comments.
llvm/lib/MC/MCObjectFileInfo.cpp
1225

This logic hasn't been thoroughly tested. We need comdat and !associated. Please see the log of the code you copied from.

MaskRay requested changes to this revision.Aug 30 2023, 2:39 PM
MaskRay added inline comments.
llvm/test/Transforms/PGOProfile/asm_emit_branch_prob.ll
2

We likely need an explicit -mtriple=x86_64 for llc commands and use REQUIRES: x86-registered-target.

69

Please don't test the string before warning:. It's not reliable in some environments when the tool is a symlink to a real executable.

This revision now requires changes to proceed.Aug 30 2023, 2:39 PM

Addresses two of the suggestions.

red1bluelost marked 2 inline comments as done.Aug 30 2023, 3:39 PM
red1bluelost added inline comments.
llvm/lib/MC/MCObjectFileInfo.cpp
1225

Could you elaborate on this? I'm very confused as to what you are referring to or how to go about fixing it.

Fixes formatting issue with git clang-format.

We have something similar internally, but didn't upstream because we are not sure if the use case is too narrow to justify burdening the code base with all the related complexity. cc @hoy

Yeah, we internally serialize machine block execution counts to the binary. Would it be helpful to write those counts into some bb section similar to the llvm_bb_addr_map section?

This is interesting and similar but our metrics are only focusing on branch weights at the moment. We've focused on branch weights so far since they are a direct result of Profile Loading.

At least for the current state of the metrics here and consumed in D158890, there is not a use for block counts but it could be a future extension of the metrics.

We're probably among the few people that will actually benefit from something like this, but honestly I'm still a bit unsure whether the use case is common enough to justify built-in support like this.

However, if we make something like this part of llvm, I suggest we at least make it as general purpose as possible. A few comments related to that:

  1. Try to incorporate block counts/frequencies as well. Most of the researches on profile quality use a block overlap metric which relies on block counts rather than branch probabilities. Our internal version also uses block counts, as branch weights cannot represent the profile for branchless code.
  1. Instead of coupling this with a specific consumer, Pin tool in your case, and in the next patch, I suggest we build general support to decode such metadata section, so tools like llvm-objdump can be used to inspect its payload.
mtrofin added a comment.EditedAug 31 2023, 10:23 PM

We have something similar internally, but didn't upstream because we are not sure if the use case is too narrow to justify burdening the code base with all the related complexity. cc @hoy

Yeah, we internally serialize machine block execution counts to the binary. Would it be helpful to write those counts into some bb section similar to the llvm_bb_addr_map section?

This is interesting and similar but our metrics are only focusing on branch weights at the moment. We've focused on branch weights so far since they are a direct result of Profile Loading.

At least for the current state of the metrics here and consumed in D158890, there is not a use for block counts but it could be a future extension of the metrics.

We're probably among the few people that will actually benefit from something like this, but honestly I'm still a bit unsure whether the use case is common enough to justify built-in support like this.

Even if relatively few groups would build up on such information, arguably PGO (as a technique) is probably the most impactful tool we have for improving performance. Like I think we're all observing here, profiles aren't necessarily well maintained throughout passes - it's actually easy to write a pass that accidentally and silently drops it, or that corrupts it somehow. I think having primitives in llvm that can help anyone interested build validation tooling and detect and fix such bugs would end up helping the community way more than their cost.

Maybe we can even, eventually, have a layer of defense doing this on a build bot with e.g. llvm-test-suite benchmarks. (I have a rfc for doing some even simpler validation transparently as part of opt/llc, would send it after the long weekend)

However, if we make something like this part of llvm, I suggest we at least make it as general purpose as possible. A few comments related to that:

  1. Try to incorporate block counts/frequencies as well. Most of the researches on profile quality use a block overlap metric which relies on block counts rather than branch probabilities. Our internal version also uses block counts, as branch weights cannot represent the profile for branchless code.

+1

  1. Instead of coupling this with a specific consumer, Pin tool in your case, and in the next patch, I suggest we build general support to decode such metadata section, so tools like llvm-objdump can be used to inspect its payload.

+1!

Actually - @red1bluelost - sorry if this would creep up the scope of your work, but since we're talking design choices, would summarizing in a RFC be maybe a good step? Then we can discuss the motivation and design choices in the community, and since there is similar work (@wenlei's earlier remark), their experience will surely help!

We're probably among the few people that will actually benefit from something like this, but honestly I'm still a bit unsure whether the use case is common enough to justify built-in support like this.

Micrea makes a good point about it being easy for a pass to corrupt PGO information the same way it can corrupt debug information. Having a tool to track or debug PGO info like branch_weights, even if something different than this patch, seems beneficial for upstream.

  1. Try to incorporate block counts/frequencies as well. Most of the researches on profile quality use a block overlap metric which relies on block counts rather than branch probabilities. Our internal version also uses block counts, as branch weights cannot represent the profile for branchless code.

I'll try to look into this.

  1. Instead of coupling this with a specific consumer, Pin tool in your case, and in the next patch, I suggest we build general support to decode such metadata section, so tools like llvm-objdump can be used to inspect its payload.

For the metrics, we need an execution branch trace for comparison, Pin tool we've found works for x86 and we have an tool for one of our internal targets. We hope to keep the tracing part minimally coupled to the compiler metadata.

Makes sense to add general support for extracting the section info. I can try to look into it.

Even if relatively few groups would build up on such information, arguably PGO (as a technique) is probably the most impactful tool we have for improving performance. Like I think we're all observing here, profiles aren't necessarily well maintained throughout passes - it's actually easy to write a pass that accidentally and silently drops it, or that corrupts it somehow. I think having primitives in llvm that can help anyone interested build validation tooling and detect and fix such bugs would end up helping the community way more than their cost.

Agreed that it is easy to corrupt. The team I'm with at MediaTek has found cases in LLVM passes and our backend where branch weights are mishandled or not updated which was a driving factor to developing metrics for it.

Maybe we can even, eventually, have a layer of defense doing this on a build bot with e.g. llvm-test-suite benchmarks. (I have a rfc for doing some even simpler validation transparently as part of opt/llc, would send it after the long weekend)

Internally we are planning to track some of our benchmarks using these metrics since we've found its helpful for this source of change/regression tracking for our PGO data.

Actually - @red1bluelost - sorry if this would creep up the scope of your work, but since we're talking design choices, would summarizing in a RFC be maybe a good step? Then we can discuss the motivation and design choices in the community, and since there is similar work (@wenlei's earlier remark), their experience will surely help!

That makes sense. I can see about getting stuff written up. It will take some time. I'll also be giving a talk at the developer meeting in October on how we've been using the metrics and where we think it can help others.

[...]

Actually - @red1bluelost - sorry if this would creep up the scope of your work, but since we're talking design choices, would summarizing in a RFC be maybe a good step? Then we can discuss the motivation and design choices in the community, and since there is similar work (@wenlei's earlier remark), their experience will surely help!

That makes sense. I can see about getting stuff written up. It will take some time. I'll also be giving a talk at the developer meeting in October on how we've been using the metrics and where we think it can help others.

BTW, the "Practical Compiler Optimizations for Warehouse-Scale Applications" workshop at the conference would also be good to attend, the topic is very much in scope!

wenlei added a comment.Sep 1 2023, 1:19 PM

We're probably among the few people that will actually benefit from something like this, but honestly I'm still a bit unsure whether the use case is common enough to justify built-in support like this.

Micrea makes a good point about it being easy for a pass to corrupt PGO information the same way it can corrupt debug information. Having a tool to track or debug PGO info like branch_weights, even if something different than this patch, seems beneficial for upstream.

  1. Try to incorporate block counts/frequencies as well. Most of the researches on profile quality use a block overlap metric which relies on block counts rather than branch probabilities. Our internal version also uses block counts, as branch weights cannot represent the profile for branchless code.

I'll try to look into this.

  1. Instead of coupling this with a specific consumer, Pin tool in your case, and in the next patch, I suggest we build general support to decode such metadata section, so tools like llvm-objdump can be used to inspect its payload.

For the metrics, we need an execution branch trace for comparison, Pin tool we've found works for x86 and we have an tool for one of our internal targets. We hope to keep the tracing part minimally coupled to the compiler metadata.

Makes sense to add general support for extracting the section info. I can try to look into it.

Even if relatively few groups would build up on such information, arguably PGO (as a technique) is probably the most impactful tool we have for improving performance. Like I think we're all observing here, profiles aren't necessarily well maintained throughout passes - it's actually easy to write a pass that accidentally and silently drops it, or that corrupts it somehow. I think having primitives in llvm that can help anyone interested build validation tooling and detect and fix such bugs would end up helping the community way more than their cost.

Agreed that it is easy to corrupt. The team I'm with at MediaTek has found cases in LLVM passes and our backend where branch weights are mishandled or not updated which was a driving factor to developing metrics for it.

Maybe we can even, eventually, have a layer of defense doing this on a build bot with e.g. llvm-test-suite benchmarks. (I have a rfc for doing some even simpler validation transparently as part of opt/llc, would send it after the long weekend)

Internally we are planning to track some of our benchmarks using these metrics since we've found its helpful for this source of change/regression tracking for our PGO data.

Actually - @red1bluelost - sorry if this would creep up the scope of your work, but since we're talking design choices, would summarizing in a RFC be maybe a good step? Then we can discuss the motivation and design choices in the community, and since there is similar work (@wenlei's earlier remark), their experience will surely help!

That makes sense. I can see about getting stuff written up. It will take some time. I'll also be giving a talk at the developer meeting in October on how we've been using the metrics and where we think it can help others.

Thanks. If we make this general enough, I agree that it'd be useful. Looking forward to RFC and related discussions.

LLVM generally doesn't do a good in profile maintenance (updating branch_weights throughout optimization pipeline), and a good step towards improving that would be having a way to quantify profile quality, which is something like this would help.

Matt added a subscriber: Matt.Sep 5 2023, 3:00 PM
MatzeB added a subscriber: MatzeB.Sep 11 2023, 9:58 AM
red1bluelost abandoned this revision.Sep 15 2023, 10:28 AM

Closing this review and the second (D158890) as I work on the RFC. By the time a second review iteration is up, it will be via GitHub PR. Hoping to get the RFC out in the next 2-3 weeks.