This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO] Process functions in a top-down order on a dynamic call graph.
ClosedPublic

Authored by hoy on Feb 3 2021, 5:00 PM.

Details

Summary

Functions are currently processed by the sample profiler loader in a top-down order defined by the static call graph. The order is being adjusted to be a top-down order based on the input context-sensitive profile. One benefit is that the processing order of caller and callee in one SCC would follow the context order in the profile to favor more inlining. Another benefit is that the processing order of caller and callee through an indirect call (which is not on the static call graph) can be honored which in turn allows for more inlining.

Two switches -mllvm -use-profile-indirect-call-edges and -mllvm -use-profile-top-down-order are being introduced. Both are on by default.

Diff Detail

Event Timeline

hoy created this revision.Feb 3 2021, 5:00 PM
hoy requested review of this revision.Feb 3 2021, 5:00 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 3 2021, 5:00 PM
wmi added a comment.Feb 5 2021, 10:52 PM

What is the main difference between dynamic call graph based inlining vs static call graph + priority based inlining (https://reviews.llvm.org/D94001)?

hoy added a comment.Feb 6 2021, 9:36 AM
In D95988#2546697, @wmi wrote:

What is the main difference between dynamic call graph based inlining vs static call graph + priority based inlining (https://reviews.llvm.org/D94001)?

The work in this change is an enhancement to the priority-based inliner in that:

  1. Honor profile SCC traversal order for more inlining. E.g, where there is a circle in the static call graph say A->B->C->A, the static SCC traversal order could be any order but deterministic, let's. say B->C->A. If at runtime we see a context A->B->C->A->B->C, llvm-profgen may compress it into A->B->C. Therefore by walking the SCC in the B->C->A top-down order in the sample profile loader, we will not get B inlined into A. This change adjusts the SCC processing order to reflect what is in profile so that A->B->C will be walked in order.
  1. Honor indirect call edge order. Similar given an indirect call A->B at runtime which is missing on the static call graph, B may end up being processed before A. We'd like A to be processed before B so that B gets a chance to be inlined into A after indirect call promotion.
wmi added a comment.Feb 8 2021, 1:46 PM
In D95988#2546946, @hoy wrote:
In D95988#2546697, @wmi wrote:

What is the main difference between dynamic call graph based inlining vs static call graph + priority based inlining (https://reviews.llvm.org/D94001)?

The work in this change is an enhancement to the priority-based inliner in that:

  1. Honor profile SCC traversal order for more inlining. E.g, where there is a circle in the static call graph say A->B->C->A, the static SCC traversal order could be any order but deterministic, let's. say B->C->A. If at runtime we see a context A->B->C->A->B->C, llvm-profgen may compress it into A->B->C. Therefore by walking the SCC in the B->C->A top-down order in the sample profile loader, we will not get B inlined into A. This change adjusts the SCC processing order to reflect what is in profile so that A->B->C will be walked in order.
  1. Honor indirect call edge order. Similar given an indirect call A->B at runtime which is missing on the static call graph, B may end up being processed before A. We'd like A to be processed before B so that B gets a chance to be inlined into A after indirect call promotion.

Thanks for the explanation. I have two questions.

  1. Since static call graph edges have been removed, many cold callsites of small functions won't be inlined anymore. Those will all be left to CGSCC inlining. I understand that inlining those small functions may not help reduce caller size and boost more inlining because there is no cleanup after inlining in place in sample loader pass. But that is a big difference from current early inlining model. Do you see any impact from it?
  1. The two benefits above of dynamic call graph also applies to non-CSSPGO profile. Is it possible to make it optional for non-CSSPGO profile so we can try it?
hoy added a comment.Feb 8 2021, 2:47 PM
In D95988#2549605, @wmi wrote:
In D95988#2546946, @hoy wrote:
In D95988#2546697, @wmi wrote:

What is the main difference between dynamic call graph based inlining vs static call graph + priority based inlining (https://reviews.llvm.org/D94001)?

The work in this change is an enhancement to the priority-based inliner in that:

  1. Honor profile SCC traversal order for more inlining. E.g, where there is a circle in the static call graph say A->B->C->A, the static SCC traversal order could be any order but deterministic, let's. say B->C->A. If at runtime we see a context A->B->C->A->B->C, llvm-profgen may compress it into A->B->C. Therefore by walking the SCC in the B->C->A top-down order in the sample profile loader, we will not get B inlined into A. This change adjusts the SCC processing order to reflect what is in profile so that A->B->C will be walked in order.
  1. Honor indirect call edge order. Similar given an indirect call A->B at runtime which is missing on the static call graph, B may end up being processed before A. We'd like A to be processed before B so that B gets a chance to be inlined into A after indirect call promotion.

Thanks for the explanation. I have two questions.

  1. Since static call graph edges have been removed, many cold callsites of small functions won't be inlined anymore. Those will all be left to CGSCC inlining. I understand that inlining those small functions may not help reduce caller size and boost more inlining because there is no cleanup after inlining in place in sample loader pass. But that is a big difference from current early inlining model. Do you see any impact from it?
  1. The two benefits above of dynamic call graph also applies to non-CSSPGO profile. Is it possible to make it optional for non-CSSPGO profile so we can try it?

For #1, this change only adds edges to the static call graph to enforce additional order. It doesn't remove existing call edges. So it shouldn't block previous profile-based inlining. Inlining for cold callsites that exist in profile will be honored with this change. Inlining for callistes that are not recorded in the profile will mostly be done by CGSCC.

For #2, it's a good point. I believe the two benefits should also help non-CS profile based inlining. Though I haven't tried that, IIUC, the counts returned for non-inlined callees should have a better quality. That said, I'm not sure the non-CS profiled inlining can benefit as much as CSSPGO does since I haven't seen the issues that motivated this diff exists with non-CSSPGO. The main reason is that the CSSPGO inliner works on a context tri that requires explicit tri edge to present when processing a call edge. This isn't required with a non-CS profile where the concept of context tri edges are nested in the current function's profile.

wmi added a comment.Feb 8 2021, 3:15 PM
In D95988#2549755, @hoy wrote:
In D95988#2549605, @wmi wrote:
In D95988#2546946, @hoy wrote:
In D95988#2546697, @wmi wrote:

What is the main difference between dynamic call graph based inlining vs static call graph + priority based inlining (https://reviews.llvm.org/D94001)?

The work in this change is an enhancement to the priority-based inliner in that:

  1. Honor profile SCC traversal order for more inlining. E.g, where there is a circle in the static call graph say A->B->C->A, the static SCC traversal order could be any order but deterministic, let's. say B->C->A. If at runtime we see a context A->B->C->A->B->C, llvm-profgen may compress it into A->B->C. Therefore by walking the SCC in the B->C->A top-down order in the sample profile loader, we will not get B inlined into A. This change adjusts the SCC processing order to reflect what is in profile so that A->B->C will be walked in order.
  1. Honor indirect call edge order. Similar given an indirect call A->B at runtime which is missing on the static call graph, B may end up being processed before A. We'd like A to be processed before B so that B gets a chance to be inlined into A after indirect call promotion.

Thanks for the explanation. I have two questions.

  1. Since static call graph edges have been removed, many cold callsites of small functions won't be inlined anymore. Those will all be left to CGSCC inlining. I understand that inlining those small functions may not help reduce caller size and boost more inlining because there is no cleanup after inlining in place in sample loader pass. But that is a big difference from current early inlining model. Do you see any impact from it?
  1. The two benefits above of dynamic call graph also applies to non-CSSPGO profile. Is it possible to make it optional for non-CSSPGO profile so we can try it?

For #1, this change only adds edges to the static call graph to enforce additional order. It doesn't remove existing call edges. So it shouldn't block previous profile-based inlining. Inlining for cold callsites that exist in profile will be honored with this change. Inlining for callistes that are not recorded in the profile will mostly be done by CGSCC.

I may misunderstand but I was refering to replaceCallGraphEdges. The command inside says:

// Remove static call edges from the call graph except for the ones from the
// root which make the call graph connected.

For #2, it's a good point. I believe the two benefits should also help non-CS profile based inlining. Though I haven't tried that, IIUC, the counts returned for non-inlined callees should have a better quality. That said, I'm not sure the non-CS profiled inlining can benefit as much as CSSPGO does since I haven't seen the issues that motivated this diff exists with non-CSSPGO. The main reason is that the CSSPGO inliner works on a context tri that requires explicit tri edge to present when processing a call edge. This isn't required with a non-CS profile where the concept of context tri edges are nested in the current function's profile.

Yes, I agree CSSPGO profile may be benefited more from this change than non-CSSPGO profile. It looks like a good enhancement for both. Do you think it is doable to make the change oblivious to the type of CS/non-CS profile?

hoy added a comment.Feb 8 2021, 3:45 PM
In D95988#2549816, @wmi wrote:
In D95988#2549755, @hoy wrote:
In D95988#2549605, @wmi wrote:
In D95988#2546946, @hoy wrote:
In D95988#2546697, @wmi wrote:

What is the main difference between dynamic call graph based inlining vs static call graph + priority based inlining (https://reviews.llvm.org/D94001)?

The work in this change is an enhancement to the priority-based inliner in that:

  1. Honor profile SCC traversal order for more inlining. E.g, where there is a circle in the static call graph say A->B->C->A, the static SCC traversal order could be any order but deterministic, let's. say B->C->A. If at runtime we see a context A->B->C->A->B->C, llvm-profgen may compress it into A->B->C. Therefore by walking the SCC in the B->C->A top-down order in the sample profile loader, we will not get B inlined into A. This change adjusts the SCC processing order to reflect what is in profile so that A->B->C will be walked in order.
  1. Honor indirect call edge order. Similar given an indirect call A->B at runtime which is missing on the static call graph, B may end up being processed before A. We'd like A to be processed before B so that B gets a chance to be inlined into A after indirect call promotion.

Thanks for the explanation. I have two questions.

  1. Since static call graph edges have been removed, many cold callsites of small functions won't be inlined anymore. Those will all be left to CGSCC inlining. I understand that inlining those small functions may not help reduce caller size and boost more inlining because there is no cleanup after inlining in place in sample loader pass. But that is a big difference from current early inlining model. Do you see any impact from it?
  1. The two benefits above of dynamic call graph also applies to non-CSSPGO profile. Is it possible to make it optional for non-CSSPGO profile so we can try it?

For #1, this change only adds edges to the static call graph to enforce additional order. It doesn't remove existing call edges. So it shouldn't block previous profile-based inlining. Inlining for cold callsites that exist in profile will be honored with this change. Inlining for callistes that are not recorded in the profile will mostly be done by CGSCC.

I may misunderstand but I was refering to replaceCallGraphEdges. The command inside says:

// Remove static call edges from the call graph except for the ones from the
// root which make the call graph connected.

For #2, it's a good point. I believe the two benefits should also help non-CS profile based inlining. Though I haven't tried that, IIUC, the counts returned for non-inlined callees should have a better quality. That said, I'm not sure the non-CS profiled inlining can benefit as much as CSSPGO does since I haven't seen the issues that motivated this diff exists with non-CSSPGO. The main reason is that the CSSPGO inliner works on a context tri that requires explicit tri edge to present when processing a call edge. This isn't required with a non-CS profile where the concept of context tri edges are nested in the current function's profile.

Yes, I agree CSSPGO profile may be benefited more from this change than non-CSSPGO profile. It looks like a good enhancement for both. Do you think it is doable to make the change oblivious to the type of CS/non-CS profile?

Sure, I think it's doable to add profile call edges for non-CS profile. The profile context extracting code will look differently from the current implementation.

wmi added a comment.Feb 8 2021, 3:54 PM
In D95988#2549879, @hoy wrote:
In D95988#2549816, @wmi wrote:
In D95988#2549755, @hoy wrote:
In D95988#2549605, @wmi wrote:
In D95988#2546946, @hoy wrote:
In D95988#2546697, @wmi wrote:

What is the main difference between dynamic call graph based inlining vs static call graph + priority based inlining (https://reviews.llvm.org/D94001)?

The work in this change is an enhancement to the priority-based inliner in that:

  1. Honor profile SCC traversal order for more inlining. E.g, where there is a circle in the static call graph say A->B->C->A, the static SCC traversal order could be any order but deterministic, let's. say B->C->A. If at runtime we see a context A->B->C->A->B->C, llvm-profgen may compress it into A->B->C. Therefore by walking the SCC in the B->C->A top-down order in the sample profile loader, we will not get B inlined into A. This change adjusts the SCC processing order to reflect what is in profile so that A->B->C will be walked in order.
  1. Honor indirect call edge order. Similar given an indirect call A->B at runtime which is missing on the static call graph, B may end up being processed before A. We'd like A to be processed before B so that B gets a chance to be inlined into A after indirect call promotion.

Thanks for the explanation. I have two questions.

  1. Since static call graph edges have been removed, many cold callsites of small functions won't be inlined anymore. Those will all be left to CGSCC inlining. I understand that inlining those small functions may not help reduce caller size and boost more inlining because there is no cleanup after inlining in place in sample loader pass. But that is a big difference from current early inlining model. Do you see any impact from it?
  1. The two benefits above of dynamic call graph also applies to non-CSSPGO profile. Is it possible to make it optional for non-CSSPGO profile so we can try it?

For #1, this change only adds edges to the static call graph to enforce additional order. It doesn't remove existing call edges. So it shouldn't block previous profile-based inlining. Inlining for cold callsites that exist in profile will be honored with this change. Inlining for callistes that are not recorded in the profile will mostly be done by CGSCC.

I may misunderstand but I was refering to replaceCallGraphEdges. The command inside says:

// Remove static call edges from the call graph except for the ones from the
// root which make the call graph connected.

Now I look more closely, I see that replaceCallGraphEdges is only used on a temporary callgraph which is separated from the static call graph, so you are right.

For #2, it's a good point. I believe the two benefits should also help non-CS profile based inlining. Though I haven't tried that, IIUC, the counts returned for non-inlined callees should have a better quality. That said, I'm not sure the non-CS profiled inlining can benefit as much as CSSPGO does since I haven't seen the issues that motivated this diff exists with non-CSSPGO. The main reason is that the CSSPGO inliner works on a context tri that requires explicit tri edge to present when processing a call edge. This isn't required with a non-CS profile where the concept of context tri edges are nested in the current function's profile.

Yes, I agree CSSPGO profile may be benefited more from this change than non-CSSPGO profile. It looks like a good enhancement for both. Do you think it is doable to make the change oblivious to the type of CS/non-CS profile?

Sure, I think it's doable to add profile call edges for non-CS profile. The profile context extracting code will look differently from the current implementation.

Thanks!

hoy updated this revision to Diff 322597.Feb 9 2021, 10:40 PM

Extending profile-based top-down order support to non-CS profile. Only adding support for SCC. Indirect call edges are not needed since uninlined counts are not returned to indirect call targets with non-CS profiles.

wmi added a comment.Feb 11 2021, 9:52 AM

Thanks for adding the support for non-CS profile!

Extending profile-based top-down order support to non-CS profile. Only adding support for SCC. Indirect call edges are not needed since uninlined counts are not returned to indirect call targets with non-CS profiles.

Indirect call edges are still helpful for non-CS profiles. That is because top-down inlining will be helpful for better non-CS profile matching fundamentally (unless annotated profile can be updated repeatedly, but that is not the case for branch probablity for non-CS profile). Using non top-down order, function may be annotated with outline instance's profile before it can be inlined and get more precise profile with context. Because there are no indirect call edges in the static call graph, it will be helpful to add them based on dynamic call graph, to enforce the top-down order inlining more thoroughly.

llvm/lib/Transforms/IPO/SampleProfile.cpp
2363–2369

We may not need this block. Top down inlining for non-CS profile is to get more precise profile matching, not to enable more inlining. If there is not inline instance profile for a callsite, early inlining in sample loader won't inline it so it doesn't need to be added into the dynamic call graph.

hoy added a comment.Feb 11 2021, 10:18 AM
In D95988#2557460, @wmi wrote:

Thanks for adding the support for non-CS profile!

Extending profile-based top-down order support to non-CS profile. Only adding support for SCC. Indirect call edges are not needed since uninlined counts are not returned to indirect call targets with non-CS profiles.

Indirect call edges are still helpful for non-CS profiles. That is because top-down inlining will be helpful for better non-CS profile matching fundamentally (unless annotated profile can be updated repeatedly, but that is not the case for branch probablity for non-CS profile). Using non top-down order, function may be annotated with outline instance's profile before it can be inlined and get more precise profile with context. Because there are no indirect call edges in the static call graph, it will be helpful to add them based on dynamic call graph, to enforce the top-down order inlining more thoroughly.

Yeah, it is fundamentally useful but I'm not sure there's way to justify the benefit right now. If you look at code below, nested callee profile is never returned to the outlined instance for unsuccessful inlining of indirect calls. I was thinking about a separate change to enable that as well as top-down order for indirect calls. What do you think?

https://github.com/llvm/llvm-project/blob/f8772da8cc9a0be65c9ba028c2b5a895c1ed4f91/llvm/lib/Transforms/IPO/SampleProfile.cpp#L1346

llvm/lib/Transforms/IPO/SampleProfile.cpp
2363–2369

Agreed, this is not needed since it does not add any benefit to profile counts returning.

wmi added a comment.Feb 11 2021, 10:48 AM
In D95988#2557503, @hoy wrote:
In D95988#2557460, @wmi wrote:

Thanks for adding the support for non-CS profile!

Extending profile-based top-down order support to non-CS profile. Only adding support for SCC. Indirect call edges are not needed since uninlined counts are not returned to indirect call targets with non-CS profiles.

Indirect call edges are still helpful for non-CS profiles. That is because top-down inlining will be helpful for better non-CS profile matching fundamentally (unless annotated profile can be updated repeatedly, but that is not the case for branch probablity for non-CS profile). Using non top-down order, function may be annotated with outline instance's profile before it can be inlined and get more precise profile with context. Because there are no indirect call edges in the static call graph, it will be helpful to add them based on dynamic call graph, to enforce the top-down order inlining more thoroughly.

Yeah, it is fundamentally useful but I'm not sure there's way to justify the benefit right now. If you look at code below, nested callee profile is never returned to the outlined instance for unsuccessful inlining of indirect calls. I was thinking about a separate change to enable that as well as top-down order for indirect calls. What do you think?

I understand profile count returning is a benefit for top-down inlining, but profile count returning are all related with cold profiles, so it may not be the major factor here? I think letting an inlined function annotated with inline instance profile with context instead of outline instance profile without context is the major reason that top-down inlining brings benefit, at least for non-CS profile. This is the original patch from Wenlei to add top-down inlining support: https://reviews.llvm.org/D70655

Sure, it is ok to address it in a separate change. Better add a TODO in the comment.

https://github.com/llvm/llvm-project/blob/f8772da8cc9a0be65c9ba028c2b5a895c1ed4f91/llvm/lib/Transforms/IPO/SampleProfile.cpp#L1346

hoy added a comment.Feb 11 2021, 11:09 AM
In D95988#2557591, @wmi wrote:
In D95988#2557503, @hoy wrote:
In D95988#2557460, @wmi wrote:

Thanks for adding the support for non-CS profile!

Extending profile-based top-down order support to non-CS profile. Only adding support for SCC. Indirect call edges are not needed since uninlined counts are not returned to indirect call targets with non-CS profiles.

Indirect call edges are still helpful for non-CS profiles. That is because top-down inlining will be helpful for better non-CS profile matching fundamentally (unless annotated profile can be updated repeatedly, but that is not the case for branch probablity for non-CS profile). Using non top-down order, function may be annotated with outline instance's profile before it can be inlined and get more precise profile with context. Because there are no indirect call edges in the static call graph, it will be helpful to add them based on dynamic call graph, to enforce the top-down order inlining more thoroughly.

Yeah, it is fundamentally useful but I'm not sure there's way to justify the benefit right now. If you look at code below, nested callee profile is never returned to the outlined instance for unsuccessful inlining of indirect calls. I was thinking about a separate change to enable that as well as top-down order for indirect calls. What do you think?

I understand profile count returning is a benefit for top-down inlining, but profile count returning are all related with cold profiles, so it may not be the major factor here? I think letting an inlined function annotated with inline instance profile with context instead of outline instance profile without context is the major reason that top-down inlining brings benefit, at least for non-CS profile. This is the original patch from Wenlei to add top-down inlining support: https://reviews.llvm.org/D70655

Sure, it is ok to address it in a separate change. Better add a TODO in the comment.

https://github.com/llvm/llvm-project/blob/f8772da8cc9a0be65c9ba028c2b5a895c1ed4f91/llvm/lib/Transforms/IPO/SampleProfile.cpp#L1346

I see. Thanks for the explanation. Yes, the annotation quality for inlined instances are more important for outlined instances. TODO added for indirect calls.

hoy updated this revision to Diff 323085.Feb 11 2021, 11:10 AM

Addressing Wei's feedbacks.

wmi added inline comments.Feb 11 2021, 11:39 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2491

Many nodes in the static graph may not exist in ProfileOrderMap so they all get the same 0 value from the map. Better use llvm::stable_sort.

hoy added inline comments.Feb 11 2021, 11:53 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2491

Good catch! Stable sort should make the change of existing static order minimized.

hoy updated this revision to Diff 323108.Feb 11 2021, 11:54 AM

Addressing Wei's comment.

wmi accepted this revision.Feb 11 2021, 12:07 PM

LGTM.

This revision is now accepted and ready to land.Feb 11 2021, 12:07 PM
This revision was landed with ongoing or failed builds.Feb 11 2021, 12:40 PM
This revision was automatically updated to reflect the committed changes.