This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/IPO/
-
Transforms/
-
IPO/
9/9
SampleProfile.cpp
-
test/Transforms/SampleProfile/
-
Transforms/
-
SampleProfile/
-
inline-mergeprof.ll

Differential D121862

[ProfSampleLoader] When disable-sample-loader-inlining is true, merge profiles of inlined instances to outlining versions.
ClosedPublic

Authored by mingmingl on Mar 16 2022, 2:50 PM.

Download Raw Diff

Details

Reviewers

xur
kazu
wenlei

Commits

rGe2074de6a872: [ProfSampleLoader] When disable-sample-loader-inlining is true, merge profiles…

Summary

When --disable-sample-loader-inlining is true, skip inline transformation, but merge profiles of inlined instances to outlining versions.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mingmingl created this revision.Mar 16 2022, 2:50 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2022, 2:50 PM

Herald added subscribers: ormris, wenlei, hiraditya. · View Herald Transcript

mingmingl published this revision for review.Mar 16 2022, 2:50 PM

mingmingl added reviewers: xur, kazu.

mingmingl added a subscriber: davidxl.

Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2022, 2:51 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

fix typo "inlinee" in comment

polish comment around option --sample-profile-merge-inlinee to minimize the potential ambiguity.

Harbormaster completed remote builds in B154720: Diff 415999.Mar 16 2022, 3:49 PM

For inlineHotFunctionsWithPriority, do not add functions to LocalNotInlinedCallSites if ContextTracker is valid (not nullptr).

This keeps the original semantic that LocalNotInlinedCallSites is empty, when ContextTracker is not nullptr.

Harbormaster completed remote builds in B154771: Diff 416075.Mar 17 2022, 12:46 AM

What's your actual use case that motivated this change?

When this was originally added, we used it to disable only pre-link sample loader inlining, in which case we don't want to merge the profiles even when inlining there is skipped. This is because during post-link, sample loader inlining will still happen, so merging profiles during pre-link as if relevant inlining will never happen may cause over-optimize (it was measurable in code size). +@hoy

In D121862#3389529, @wenlei wrote:

What's your actual use case that motivated this change?

When this was originally added, we used it to disable only pre-link sample loader inlining, in which case we don't want to merge the profiles even when inlining there is skipped. This is because during post-link, sample loader inlining will still happen, so merging profiles during pre-link as if relevant inlining will never happen may cause over-optimize (it was measurable in code size). +@hoy

Good point about the post-link sample loader inlinling. The intention of this internal flag is actually to disable both sample loader inlining for performance experiment purpose (e.g exercising more cost/benefit analysis of regular inlining) -- related effort is kazu@ is doing analysis on context similarity analysis to try to tune down the sample loader inlining.

related effort is kazu@ is doing analysis on context similarity analysis to try to tune down the sample loader inlining.

Curious to hear more about this. Is tuning down sample loader inlining for performance or something else (e.g. compile time)? The problem of SCC inliner being bottom-up makes it less selective, which is a disadvantage comparing to early top-down inlining, so we were actually hoping to move more inlining from SCC to sample loader. Or are you shifting inlining to new ModuleInliner?

In D121862#3389598, @wenlei wrote:

related effort is kazu@ is doing analysis on context similarity analysis to try to tune down the sample loader inlining.

Curious to hear more about this. Is tuning down sample loader inlining for performance or something else (e.g. compile time)? The problem of SCC inliner being bottom-up makes it less selective, which is a disadvantage comparing to early top-down inlining, so we were actually hoping to move more inlining from SCC to sample loader. Or are you shifting inlining to new ModuleInliner?

Re SCC inlining, yes -- kazu's prologue/epilogue based cost analysis won't work well it either -- so yes eventually more sophisticated analysis will rely on moduleinliner. Working set based analysis I mentioned a while back ago is another candidate.

In D121862#3389570, @davidxl wrote:

In D121862#3389529, @wenlei wrote:

What's your actual use case that motivated this change?

When this was originally added, we used it to disable only pre-link sample loader inlining, in which case we don't want to merge the profiles even when inlining there is skipped. This is because during post-link, sample loader inlining will still happen, so merging profiles during pre-link as if relevant inlining will never happen may cause over-optimize (it was measurable in code size). +@hoy

Good point about the post-link sample loader inlinling. The intention of this internal flag is actually to disable both sample loader inlining for performance experiment purpose (e.g exercising more cost/benefit analysis of regular inlining) -- related effort is kazu@ is doing analysis on context similarity analysis to try to tune down the sample loader inlining.

Can you shed more light on the context similarity analysis? Thanks.

so yes eventually more sophisticated analysis will rely on moduleinliner.

Do you have plan to merge the moduleinliner and sample loader inliner eventually? How would them be positioned when sample profile is available?

In D121862#3389678, @hoy wrote:

In D121862#3389570, @davidxl wrote:

In D121862#3389529, @wenlei wrote:

What's your actual use case that motivated this change?

When this was originally added, we used it to disable only pre-link sample loader inlining, in which case we don't want to merge the profiles even when inlining there is skipped. This is because during post-link, sample loader inlining will still happen, so merging profiles during pre-link as if relevant inlining will never happen may cause over-optimize (it was measurable in code size). +@hoy

Good point about the post-link sample loader inlinling. The intention of this internal flag is actually to disable both sample loader inlining for performance experiment purpose (e.g exercising more cost/benefit analysis of regular inlining) -- related effort is kazu@ is doing analysis on context similarity analysis to try to tune down the sample loader inlining.

Can you shed more light on the context similarity analysis? Thanks.

@kazu can help elaborate on it.

so yes eventually more sophisticated analysis will rely on moduleinliner.

Do you have plan to merge the moduleinliner and sample loader inliner eventually? How would them be positioned when sample profile is available?

Sample loader inliner will still be needed but tuned down to where it is needed for profile quality. We hope to ship more decisions to later inliner with more analysis.

ormris removed a subscriber: ormris.Mar 17 2022, 10:56 AM

Update comment about option --disable-sample-loader-inlining, to call out it doesn't skip profile merging (as the patch currently is).

I'll follow up the discussion (regarding whether this is the desired behavior) in the patch.

In D121862#3389529, @wenlei wrote:

What's your actual use case that motivated this change?

When this was originally added, we used it to disable only pre-link sample loader inlining, in which case we don't want to merge the profiles even when inlining there is skipped.

If the profiles of inlined instances are not merged, and the instances are not inlined, I'm wondering if the profiles will be inaccurate for other optimization passes (not only inliner itself).

I think the answer is yes, so decide to make this patch.

This is because during post-link, sample loader inlining will still happen, so merging profiles during pre-link as if relevant inlining will never happen may cause over-optimize (it was measurable in code size). +@hoy

Thanks for pointing this out!

For my understanding, how does over-optimization or measurable code size change happen when profiles are merged and pre-link inliner skips IR transformation? Does it happen for the following scenario?

"merging without inlining" cause some functions (on the edge) to become hot functions; these hot functions are imported in post-link sample loader, and cause code size increase

Harbormaster completed remote builds in B154956: Diff 416361.Mar 17 2022, 5:30 PM

For my understanding, how does over-optimization or measurable code size change happen when profiles are merged and pre-link inliner skips IR transformation? Does it happen for the following scenario?

"merging without inlining" cause some functions (on the edge) to become hot functions; these hot functions are imported in post-link sample loader, and cause code size increase

If we skip inlining at a particular call site during pre-link, but still inline that site during post-link, merging causes pre-link passes to be unnecessarily aggressive (including pre-link SCC inliner for the callee) based on merged (larger) counts because the merged counts don't reflect post-link inlining.

In D121862#3394290, @wenlei wrote:

For my understanding, how does over-optimization or measurable code size change happen when profiles are merged and pre-link inliner skips IR transformation? Does it happen for the following scenario?

"merging without inlining" cause some functions (on the edge) to become hot functions; these hot functions are imported in post-link sample loader, and cause code size increase

If we skip inlining at a particular call site during pre-link, but still inline that site during post-link, merging causes pre-link passes to be unnecessarily aggressive (including pre-link SCC inliner for the callee) based on merged (larger) counts because the merged counts don't reflect post-link inlining.

Since --disable-sample-loader-inlining applies on both pre-link and post-link sample loader inlining in one clang invocation, I'm assuming the (undesired aggressive) inlining happens in feedback-driven inliner (i.e. InlinerPass::run pointed by [1])

Is my understanding correct that the current implementation [2] of --disable-sample-loader-inlining would (undesirably) inline a function in post-linking stage due to feedback-driven inliner, if this function satisfy following conditions

The function is hot within the module,
- So inline is desired in pre-linking stage.
The function is cold across module
- So inline is not desired in post-linking stage, but got inlined (and could be counter-productive) by feedback-driven inliner.

If the scenario above is where problem happens, I'm planning to

Call out this undesired side-effect for the usage of --disable-sample-loader-inlining
Call out the option is used to evaluate AutoFDO inliner, and recommend using it together with hacks [3] in feedback-driven inliner for evaluation purpose.
Proceed with this change, as the current implementation could be confusing (in-accurate profiles that affect more than sample-loader inlining)

May I get some feedback on the plans above? thanks!

[1] https://github.com/llvm/llvm-project/blob/f4b794427e8037a4e952cacdfe7201e961f31a6f/llvm/lib/Transforms/IPO/Inliner.cpp#L745
[2] the current impl is

skipping sample-loader inlining transformations artificially
merging profiles of inline instance to its outlining version

[3] The hack is to still use static information in calculating "threshold" but ignores profiles (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Analysis/InlineCost.cpp#L1869-L1902 is where profiles are used to calculate threshold)

In D121862#3398140, @luna wrote:

In D121862#3394290, @wenlei wrote:

For my understanding, how does over-optimization or measurable code size change happen when profiles are merged and pre-link inliner skips IR transformation? Does it happen for the following scenario?

"merging without inlining" cause some functions (on the edge) to become hot functions; these hot functions are imported in post-link sample loader, and cause code size increase

If we skip inlining at a particular call site during pre-link, but still inline that site during post-link, merging causes pre-link passes to be unnecessarily aggressive (including pre-link SCC inliner for the callee) based on merged (larger) counts because the merged counts don't reflect post-link inlining.

Since --disable-sample-loader-inlining applies on both pre-link and post-link sample loader inlining in one clang invocation, I'm assuming the (undesired aggressive) inlining happens in feedback-driven inliner (i.e. InlinerPass::run pointed by [1])

Is my understanding correct that the current implementation [2] of --disable-sample-loader-inlining would (undesirably) inline a function in post-linking stage due to feedback-driven inliner, if this function satisfy following conditions

The function is hot within the module,

So inline is desired in pre-linking stage.

The function is cold across module

So inline is not desired in post-linking stage, but got inlined (and could be counter-productive) by feedback-driven inliner.

If the scenario above is where problem happens, I'm planning to

Call out this undesired side-effect for the usage of --disable-sample-loader-inlining

Call out the option is used to evaluate AutoFDO inliner, and recommend using it together with hacks [3] in feedback-driven inliner for evaluation purpose.

Proceed with this change, as the current implementation could be confusing (in-accurate profiles that affect more than sample-loader inlining)

May I get some feedback on the plans above? thanks!

[1] https://github.com/llvm/llvm-project/blob/f4b794427e8037a4e952cacdfe7201e961f31a6f/llvm/lib/Transforms/IPO/Inliner.cpp#L745
[2] the current impl is

skipping sample-loader inlining transformations artificially

merging profiles of inline instance to its outlining version

[3] The hack is to still use static information in calculating "threshold" but ignores profiles (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Analysis/InlineCost.cpp#L1869-L1902 is where profiles are used to calculate threshold)

(An addendum to this comment)

With or without profile merging, artificially skipping inline transformations cause changes in profile and could cause unexpected change (even just for evaluation purpose)

To elaborate

With profile merging, functions (that are cold across modules and hot within a module) become hot across modules
Without profile merging, the profiles are inaccurate (e.g., an outline function could have been (and should be) considered hot if profiles of inlined instance are accounted for).

When the option was introduced in https://reviews.llvm.org/D120344, I think the intention is to use it to disable both sampleloader inlining (pre/postlink), so perhaps we can just tighten the comment for the option and document that the profiles are merged back.

If there is a need for prelink disabling in the future, we can introduce a new option for it (which does not merge profile back).

Update option description of --disable-sample-loader-inlining. Mention the side-effects of profile-merging when inline transformations are skipped, and point to D121862 (this patch) about more details.

In D121862#3398171, @davidxl wrote:

When the option was introduced in https://reviews.llvm.org/D120344, I think the intention is to use it to disable both sampleloader inlining (pre/postlink), so perhaps we can just tighten the comment for the option and document that the profiles are merged back.

If there is a need for prelink disabling in the future, we can introduce a new option for it (which does not merge profile back).

Done.

I modified the option description. The description become (unavoidably) verbose now. But since option description is a centralized place to convey the caveats. decided to make it the current way, so it's not easily missed.

Minor polish of option description. No code change.

Harbormaster completed remote builds in B155524: Diff 417137.Mar 21 2022, 5:38 PM

In D121862#3398171, @davidxl wrote:

When the option was introduced in https://reviews.llvm.org/D120344, I think the intention is to use it to disable both sampleloader inlining (pre/postlink), so perhaps we can just tighten the comment for the option and document that the profiles are merged back.

If there is a need for prelink disabling in the future, we can introduce a new option for it (which does not merge profile back).

Sounds good.

When this was originally added, we used it to disable only pre-link sample loader inlining

Just realized this was your change. My above comment was referring to an internal change (and I didn't realize it's internal).

llvm/lib/Transforms/IPO/SampleProfile.cpp
1202–1203	Move the check into tryPromoteAndInlineCandidate/tryInlineCandidate, so we don't need to spray it everywhere?

Move DisableSampleLoaderInlining into {tryPromoteAndInlineCandidate,tryInlineCandidate} as suggested.

Update option description of --disable-sample-loader-inlining.

Only option description is changed. No code change.

Auto-format gives short lines (by splitting one long line into two). Manually move the next line up for cosmetic reasons.

Harbormaster completed remote builds in B155650: Diff 417329.Mar 22 2022, 9:55 AM

lgtm with some nits.

llvm/lib/Transforms/IPO/SampleProfile.cpp
183–184	This seems too verbose - it's the longest message among all flags here. But I don't have a strong opinion.
191	remove
1495–1496	Remove this comment as well.
1510	nit: remove this white space line and avoid unrelated changes.

This revision is now accepted and ready to land.Mar 22 2022, 4:29 PM

Address comments (diffs are described in comment reply).

thanks for the reviews!

llvm/lib/Transforms/IPO/SampleProfile.cpp
183–184	I agree. Took the liberty to move the side-effects part into a "//" comment. `--sample-profile-remapping-file` (in the same cpp file) has a comment [1] besides option description, so hopefully this is fine for code style. [1] https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/IPO/SampleProfile.cpp#L123-L125
191	thanks for the catch! Done.

Remove quote in code comment.

wenlei added inline comments.Mar 23 2022, 10:00 AM

llvm/lib/Transforms/IPO/SampleProfile.cpp
1495	I'd avoid random white space changes in general, unless you're intentionally fixing some format issues.

remove blank lines around untouched lines as suggested.

thanks for the reviews!

llvm/lib/Transforms/IPO/SampleProfile.cpp
1495	Remove the added black lines here, and three other places. unless you're intentionally fixing some format issues. (This is opinion-based, so just to explain why I add so many blank lines) Blank lines were added when I didn't see a curly brace around one-line if-else, even if it's the code style ( https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies-of-if-else-loop-statements) I'll keep avoid adding blank lines around untouched lines.

Harbormaster completed remote builds in B155906: Diff 417683.Mar 23 2022, 11:52 AM

This revision was landed with ongoing or failed builds.Mar 23 2022, 1:07 PM

Closed by commit rGe2074de6a872: [ProfSampleLoader] When disable-sample-loader-inlining is true, merge profiles… (authored by mingmingl). · Explain Why

This revision was automatically updated to reflect the committed changes.

mingmingl added a commit: rGe2074de6a872: [ProfSampleLoader] When disable-sample-loader-inlining is true, merge profiles….

@hoy, @wenlei Sorry for the extremely delayed reply to your questions. Here are some thoughts between @davidxl and myself:

Quick background

I enabled the cost-benefit analysis in early 2021 for instrumentation FDO. It gives us performance gains by inlining big but very hot call sites that would be rejected by the simple size-based threshold. At the same time, we keep the combined size of .text.hot and .text largely the same by rejecting small but marginally hot call sites. In the end, we reduce the call instruction frequency -- the number of call instructions exeucted per 1000 retired instructions.

Analyzing our large internal benchmark reveals several problems of AutoFDO relative to instrumentation FDO:

The AutoFDO binary performs worse.

The combined size of .text.hot and .text of the AutoFDO executable is 8 times as big as that of the FDO executable (even with the machine function splitting turned off).

The AutoFDO binary is more frontend bound and causes more i-cache and iTLB misses on x86. That is, the backend is sitting idle, waiting for decoded instructions.

The AutoFDO binary invokes call instructions more often than the FDO binary (even with the cost-benefit analysis disabled).

So, all in all, it's clear that we are inlining a less-than-ideal set of call sites.

Context similarity analysis

I'm exploring the opposite of what you are exploring -- shifting some inlining from the sample loader inliner to the SCC inliner with the cost benefit analysis enabled, which is disabled for AutoFDO for now.

In my experiments, I don't get consistent performance wins from simply tuning down the sample loader inliner with increased thresholds on sample counts and enabling the cost-benefit analysis for the SCC inliner. So, I am wondering if we could intelligently tune down the sample loader inliner -- inlining context-sensitive callees in the sample loader inliner and leaving the rest to the SCC inliner. Note that the sample loader inliner is the only place where we can take advantage of context sensitivity. Once we flatten the profile of a given callee, we lose information on context sensitivity.

I did some analysis on context sensitivity with clang (as an application) and our large internal benchmark. It turns out that inlining a context-sensitive function to its immediate caller will allow us to take advantage of most of the context sensitivity. Specifically, in clang, only 33% of functions have a single behavior. Now, given A->B, if we hypothetically created a copy of B and named it B_called_by_A for every callee at the source code level, 91% of functions would have a single behavior. Our large internal benchmark is similar to clang in this aspect.

Module inliner

I am planning to explore the possibility of (largely) replacing the SCC inliner with the module inliner. Basically, we would inline call sites in the descending order of profitability -- most likely the ratio of cycle savings to size costs.

My plan is to try it out with instrumenation FDO first as we have very accurate (but context insensitive) profile information.

Prologue/epilogue analysis

@davidxl mentioned this, so I might as well expand it a little bit here. We do spend cycles on prologue and epilogue, but we do not take that into account in inlining. Specifically, given A->B->C, inlining C into B could make B's prologue/epilogue bigger because of increased register usage. If B doesn't call C often enough, then B's bigger prologue/epilogue could slow down things when A calls B.

There is a room for improvement in this area, but it's hard to capture that in the SCC inliner. If we avoid inlining C into B because of the prologue/epilogue size concerns, but B later gets inlined A, then we worry about the prologue/epilogue size in vain. We need an inliner where we don't simply discard call sites that do not look profitable currently during inlining.

I'm hoping that the module inliner fits the bill here. A call site that does not look profitable curretly simply stays in the priority queue. If B->C does not look profitable enough now, we might inline A->B first, which may make A's prologue/epilogue big enough. At that point, inlining C into A (with B inlined into it) may cause no harm to A's prologue/epilogue.

Sample loader inliner and module inliner

This combination is pretty far into the future in my current plan, but the core idea will probably stay the same. Let the sample loader inliner inline profitable context-sensitive call sites, leaving the rest to the module inliner (as opposed to the SCC inliner).

Thanks for sharing your insights @kazu.

We actually tried cost-benefit inlining for both IRPGO and Sample PGO. We didn't observe measureable gain with IRPGO on HHVM workload, but .text size was 5% smaller with same perf.

For sample PGO, we tried to annotate callsite weight first and then invoke call analyzer for cost-benefit analysis based inlining, all within sample loader, but also didn't get better results.

We still believe cost-benefit analysis is the way to go, but we just haven't look deep into it to understand why it didn't bring benefit for our workload.

For module inliner vs sample loader inliner, I understand that it'd be more accurate to estimate size and simplification effect of inlining when done later in the pipeline, but from the current analysis, it looks like we won't miss much if we do cost-benefit analysis in sample loader, have you considered that? This way you get to leverage context sensitivity and also benefit from cost-benefit analysis.

About context similarity, IIUC you experiments indicates majority of function profiles are actually not context sensitive? That is interesting. For CSSPGO on one internal workload, we see noticeable perf degradation if we try to cap context profile depth -- that indicates there's enough context sensitivity on critical path for it to matter. Maybe this is workload dependent.

The prolog/epilog analysis would need to be done a bit late. It'd be interesting to see to how accurately we can model them. Maybe you have seen pathological cases where it's important, but I wonder how much they help the general cases.

The combined size of .text.hot and .text of the AutoFDO executable is 8 times as big as that of the FDO executable (even with the machine function splitting turned off).

8x total .text difference is very surprising. For us, on HHVM workload, AutoFDO/CSSPGO total .text size is only a few % larger than IRPGO using the same training. For your case, did you use the same training for both, or was IRPGO using benchmark/canary while AutoFDO was on fleet-wide profile?

For sampling PGO in general, we found there's a correlation between perf improvement and size increase, even when we were tweaking selectiveness rather than aggressiveness. Similarly, we found that tuning inlining to be more aggressive than the default often leads to better perf. Wondering if you've observed something similar.

@wenlei, thank *you* for sharing your insights.

For module inliner vs sample loader inliner, I understand that it'd be more accurate to estimate size and simplification effect of inlining when done later in the pipeline, but from the current analysis, it looks like we won't miss much if we do cost-benefit analysis in sample loader, have you considered that? This way you get to leverage context sensitivity and also benefit from cost-benefit analysis.

I haven't considered doing cost-benefit analysis in the sample loader inliner, but I have observed something consistent with your comment about accuracy. Specifically, that the cycle savings are dominated by the call site cost (that is, the cost of the call and return instructions) computed in InlineCostCallAnalyzer::costBenefitAnalysis in InlineCost.cpp:

// Compute the total savings for the call site.
auto *CallerBB = CandidateCall.getParent();
BlockFrequencyInfo *CallerBFI = &(GetBFI(*(CallerBB->getParent())));
CycleSavings += getCallsiteCost(this->CandidateCall, DL);
CycleSavings *= CallerBFI->getBlockProfileCount(CallerBB).getValue();

I do compute the savings from conditionals being folded and such (look for a comment Count a conditional branch as savings if it becomes unconditional.), but that's a relatively minor addition. For large enough applications, hot call sites are easily a factor of 100x or 1000x apart in terms of their sample counts. Even if our size estimate is off by a factor of, say 2, because we forget to take folded branches into account, the slightly imprecise cost-benefit ratio just affects those call sites around the cost-benefit ratio threshold without significantly impacting the set of important call sites to be inlined.

My observation above is also consistent with the following. Small changes to the cost-benefit threshold just affects, again, those call sites around the cost-benefit ratio threshold without significantly impacting the set of important call sites to be inlined. Requiring more benefits would produce a smaller executable with comparable performance. In concrete terms, you might set InlineSavingsMultiplier to 4 to 6 for instrumentation FDO applications.

About context similarity, IIUC you experiments indicates majority of function profiles are actually not context sensitive? That is interesting. For CSSPGO on one internal workload, we see noticeable perf degradation if we try to cap context profile depth -- that indicates there's enough context sensitivity on critical path for it to matter. Maybe this is workload dependent.

No, the majority of functions (about 67% for clang) are context sensitive. I did say that the hypothetical renaming trick (that is, one level of inlining) would take advantage of most of the context sensitivity, but that's for the entire AutoFDO application. If we focus on the top 90% of samples or something, the picture may look quite different. It's totally possible that our applications also require certain depths of inlining to reveal the context sensitivity.

The prolog/epilog analysis would need to be done a bit late. It'd be interesting to see to how accurately we can model them.

IIRC, if I do the traditional iterative live variable analysis on the LLVM IR, not the machine IR, and take into account the x86-specific calling conventions (but treat all variables as integer/pointer variables for simplificy), I can accurately estimate the number of PUSH instructions 67% of the time and no more than off by one 93% of the time. Of course, if the inline "root" keeps changing during inlining, the fairly accurate estimate is of no use. I'll revisit this project when I start playing with the module inliner, where the inline "root" will hopefully tend to stick around, and less important call sites can wait in the priority queue instead being discarded.

8x total .text difference is very surprising. For us, on HHVM workload, AutoFDO/CSSPGO total .text size is only a few % larger than IRPGO using the same training. For your case, did you use the same training for both, or was IRPGO using benchmark/canary while AutoFDO was on fleet-wide profile?

The latter. The instrumentation FDO gets trained with representative workload, while the AutoFDO executable is built with the profile collected from the instrumentation FDO executable processing the real workload. I don't know how good an idea it is to build an AutoFDO executable this way. At least, it's not a typical way to build AutoFDO executables.

For sampling PGO in general, we found there's a correlation between perf improvement and size increase, even when we were tweaking selectiveness rather than aggressiveness. Similarly, we found that tuning inlining to be more aggressive than the default often leads to better perf. Wondering if you've observed something similar.

No, I haven't played with the AutoFDO parameters in isolation. That said, I suspect (but haven't verified) that out AutoFDO exeuctable is well past your correlation between perf improvement and size increase. Knowing that the instrumentation FDO executable performs a few percent better than the AutoFDO counterpart along with other goodies (smaller .text/.text.hot footprint, reduced i-cache/iTLB misses, and reduced call instruction frequency, etc), I'm hoping to put the AutoFDO executable on another slope to a higher maximum.

Thanks for sharing more about the context and the background, @kazu

I enabled the cost-benefit analysis in early 2021 for instrumentation FDO. It gives us performance gains by inlining big but very hot call sites that would be rejected by the simple size-based threshold.

How about the build time? We've encountered dramatic build time increase due to giant functions as a result of aggressive AutoFDO inlining.

So, I am wondering if we could intelligently tune down the sample loader inliner -- inlining context-sensitive callees in the sample loader inliner and leaving the rest to the SCC inliner.

I'm trying to find the theory behind this. The two inliner behaves differently due to their style, i.e, top-down vs bottom-up. Regardless of context-sensitivity, shifting inlining from one to the other may create different inlining patterns, and I didn't find an easy way to tell which is better. Please let me know your insight on this. We are trying to make the whole inlining process perform uniformly (for now trying to shift as much inlining towards the top-down sample loader inliner) to focus on tuning an inlining model towards one inliner. We found it more complicated in LTO mode where prelink CGSCC inlining could affect post-link sample loader inlining in an unexpected way, i.e, increasing candidate size and breaking the pre-existing calling context in the profile. This makes it harder to tune the heuristics for the sample loader. We tried to turn off the prelink cgscc inliner and saw some size wins but neutral perf. We will likely do more investigation in this direction.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

SampleProfile.cpp

39 lines

test/

Transforms/

SampleProfile/

inline-mergeprof.ll

4 lines

Diff 417724

llvm/lib/Transforms/IPO/SampleProfile.cpp

Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	cl::opt<bool>
SortProfiledSCC("sort-profiled-scc-member", cl::init(true), cl::Hidden,		SortProfiledSCC("sort-profiled-scc-member", cl::init(true), cl::Hidden,
cl::desc("Sort profiled recursion by edge weights."));		cl::desc("Sort profiled recursion by edge weights."));

static cl::opt<bool> ProfileSizeInline(		static cl::opt<bool> ProfileSizeInline(
"sample-profile-inline-size", cl::Hidden, cl::init(false),		"sample-profile-inline-size", cl::Hidden, cl::init(false),
cl::desc("Inline cold call sites in profile loader if it's beneficial "		cl::desc("Inline cold call sites in profile loader if it's beneficial "
"for code size."));		"for code size."));

		// Since profiles are consumed by many passes, turning on this option has
		// side effects. For instance, pre-link SCC inliner would see merged profiles
		// and inline the hot functions (that are skipped in this pass).
static cl::opt<bool> DisableSampleLoaderInlining(		static cl::opt<bool> DisableSampleLoaderInlining(
"disable-sample-loader-inlining", cl::Hidden, cl::init(false),		"disable-sample-loader-inlining", cl::Hidden, cl::init(false),
cl::desc("If true, turn off inliner in sample profile loader. Used for "		cl::desc("If true, artifically skip inline transformation in sample-loader "
"evaluation or debugging."));		"pass, and merge (or scale) profiles (as configured by "
		"--sample-profile-merge-inlinee)."));

cl::opt<int> ProfileInlineGrowthLimit(		cl::opt<int> ProfileInlineGrowthLimit(
		wenleiUnsubmitted Done Reply Inline Actions This seems too verbose - it's the longest message among all flags here. But I don't have a strong opinion. wenlei: This seems too verbose - it's the longest message among all flags here. But I don't have a…
		mingminglAuthorUnsubmitted Done Reply Inline Actions I agree. Took the liberty to move the side-effects part into a "//" comment. `--sample-profile-remapping-file` (in the same cpp file) has a comment [1] besides option description, so hopefully this is fine for code style. [1] https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/IPO/SampleProfile.cpp#L123-L125 mingmingl: I agree. Took the liberty to move the side-effects part into a "//" comment. `--sample…
"sample-profile-inline-growth-limit", cl::Hidden, cl::init(12),		"sample-profile-inline-growth-limit", cl::Hidden, cl::init(12),
cl::desc("The size growth ratio limit for proirity-based sample profile "		cl::desc("The size growth ratio limit for proirity-based sample profile "
"loader inlining."));		"loader inlining."));

cl::opt<int> ProfileInlineLimitMin(		cl::opt<int> ProfileInlineLimitMin(
"sample-profile-inline-limit-min", cl::Hidden, cl::init(100),		"sample-profile-inline-limit-min", cl::Hidden, cl::init(100),
cl::desc("The lower bound of size growth limit for "		cl::desc("The lower bound of size growth limit for "
		wenleiUnsubmitted Done Reply Inline Actions remove wenlei: remove
		mingminglAuthorUnsubmitted Done Reply Inline Actions thanks for the catch! Done. mingmingl: thanks for the catch! Done.
"proirity-based sample profile loader inlining."));		"proirity-based sample profile loader inlining."));

cl::opt<int> ProfileInlineLimitMax(		cl::opt<int> ProfileInlineLimitMax(
"sample-profile-inline-limit-max", cl::Hidden, cl::init(10000),		"sample-profile-inline-limit-max", cl::Hidden, cl::init(10000),
cl::desc("The upper bound of size growth limit for "		cl::desc("The upper bound of size growth limit for "
"proirity-based sample profile loader inlining."));		"proirity-based sample profile loader inlining."));

cl::opt<int> SampleHotCallSiteThreshold(		cl::opt<int> SampleHotCallSiteThreshold(
▲ Show 20 Lines • Show All 724 Lines • ▼ Show 20 Lines
/// promoting given candidate.		/// promoting given candidate.
/// \param Sum Prorated sum of remaining target counts for indirect call		/// \param Sum Prorated sum of remaining target counts for indirect call
/// after promoting given candidate.		/// after promoting given candidate.
/// \param InlinedCallSite Output vector for new call sites exposed after		/// \param InlinedCallSite Output vector for new call sites exposed after
/// inlining.		/// inlining.
bool SampleProfileLoader::tryPromoteAndInlineCandidate(		bool SampleProfileLoader::tryPromoteAndInlineCandidate(
Function &F, InlineCandidate &Candidate, uint64_t SumOrigin, uint64_t &Sum,		Function &F, InlineCandidate &Candidate, uint64_t SumOrigin, uint64_t &Sum,
SmallVector<CallBase , 8> InlinedCallSite) {		SmallVector<CallBase , 8> InlinedCallSite) {
		// Bail out early if sample-loader inliner is disabled.
		if (DisableSampleLoaderInlining)
		return false;

// Bail out early if MaxNumPromotions is zero.		// Bail out early if MaxNumPromotions is zero.
// This prevents allocating an array of zero length in callees below.		// This prevents allocating an array of zero length in callees below.
if (MaxNumPromotions == 0)		if (MaxNumPromotions == 0)
return false;		return false;
auto CalleeFunctionName = Candidate.CalleeSamples->getFuncName();		auto CalleeFunctionName = Candidate.CalleeSamples->getFuncName();
auto R = SymbolMap.find(CalleeFunctionName);		auto R = SymbolMap.find(CalleeFunctionName);
if (R == SymbolMap.end() \|\| !R->getValue())		if (R == SymbolMap.end() \|\| !R->getValue())
return false;		return false;
▲ Show 20 Lines • Show All 164 Lines • ▼ Show 20 Lines	for (auto &Child : Node->getAllChildContext()) {
ContextTrieNode *CalleeNode = &Child.second;		ContextTrieNode *CalleeNode = &Child.second;
CalleeList.push(CalleeNode);		CalleeList.push(CalleeNode);
}		}
}		}
}		}

/// Iteratively inline hot callsites of a function.		/// Iteratively inline hot callsites of a function.
///		///
/// Iteratively traverse all callsites of the function \p F, and find if		/// Iteratively traverse all callsites of the function \p F, so as to
/// the corresponding inlined instance exists and is hot in profile. If		/// find out callsites with corresponding inline instances.
/// it is hot enough, inline the callsites and adds new callsites of the		///
/// callee into the caller. If the call is an indirect call, first promote		/// For such callsites,
		/// - If it is hot enough, inline the callsites and adds callsites of the callee
		/// into the caller. If the call is an indirect call, first promote
/// it to direct call. Each indirect call is limited with a single target.		/// it to direct call. Each indirect call is limited with a single target.
///		///
		/// - If a callsite is not inlined, merge the its profile to the outline
		/// version (if --sample-profile-merge-inlinee is true), or scale the
		/// counters of standalone function based on the profile of inlined
		/// instances (if --sample-profile-merge-inlinee is false).
		///
		/// Later passes may consume the updated profiles.
		///
/// \param F function to perform iterative inlining.		/// \param F function to perform iterative inlining.
/// \param InlinedGUIDs a set to be updated to include all GUIDs that are		/// \param InlinedGUIDs a set to be updated to include all GUIDs that are
/// inlined in the profiled binary.		/// inlined in the profiled binary.
///		///
/// \returns True if there is any inline happened.		/// \returns True if there is any inline happened.
bool SampleProfileLoader::inlineHotFunctions(		bool SampleProfileLoader::inlineHotFunctions(
Function &F, DenseSet<GlobalValue::GUID> &InlinedGUIDs) {		Function &F, DenseSet<GlobalValue::GUID> &InlinedGUIDs) {
if (DisableSampleLoaderInlining)
return false;
// ProfAccForSymsInList is used in callsiteIsHot. The assertion makes sure		// ProfAccForSymsInList is used in callsiteIsHot. The assertion makes sure
// Profile symbol list is ignored when profile-sample-accurate is on.		// Profile symbol list is ignored when profile-sample-accurate is on.
assert((!ProfAccForSymsInList \|\|		assert((!ProfAccForSymsInList \|\|
(!ProfileSampleAccurate &&		(!ProfileSampleAccurate &&
!F.hasFnAttribute("profile-sample-accurate"))) &&		!F.hasFnAttribute("profile-sample-accurate"))) &&
"ProfAccForSymsInList should be false when profile-sample-accurate "		"ProfAccForSymsInList should be false when profile-sample-accurate "
"is enabled");		"is enabled");

▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	for (CallBase *I : CIS) {
for (const auto FS : findIndirectCallFunctionSamples(I, Sum)) {		for (const auto FS : findIndirectCallFunctionSamples(I, Sum)) {
uint64_t SumOrigin = Sum;		uint64_t SumOrigin = Sum;
if (LTOPhase == ThinOrFullLTOPhase::ThinLTOPreLink) {		if (LTOPhase == ThinOrFullLTOPhase::ThinLTOPreLink) {
findExternalInlineCandidate(I, FS, InlinedGUIDs, SymbolMap,		findExternalInlineCandidate(I, FS, InlinedGUIDs, SymbolMap,
PSI->getOrCompHotCountThreshold());		PSI->getOrCompHotCountThreshold());
continue;		continue;
}		}
if (!callsiteIsHot(FS, PSI, ProfAccForSymsInList))		if (!callsiteIsHot(FS, PSI, ProfAccForSymsInList))
continue;		continue;

		wenleiUnsubmitted Done Reply Inline Actions Move the check into tryPromoteAndInlineCandidate/tryInlineCandidate, so we don't need to spray it everywhere? wenlei: Move the check into tryPromoteAndInlineCandidate/tryInlineCandidate, so we don't need to spray…
Candidate = {I, FS, FS->getEntrySamples(), 1.0};		Candidate = {I, FS, FS->getEntrySamples(), 1.0};
if (tryPromoteAndInlineCandidate(F, Candidate, SumOrigin, Sum)) {		if (tryPromoteAndInlineCandidate(F, Candidate, SumOrigin, Sum)) {
LocalNotInlinedCallSites.erase(I);		LocalNotInlinedCallSites.erase(I);
LocalChanged = true;		LocalChanged = true;
}		}
}		}
} else if (CalledFunction && CalledFunction->getSubprogram() &&		} else if (CalledFunction && CalledFunction->getSubprogram() &&
!CalledFunction->isDeclaration()) {		!CalledFunction->isDeclaration()) {
Show All 14 Lines	bool SampleProfileLoader::inlineHotFunctions(
// base profile is being retrieved.		// base profile is being retrieved.
if (!FunctionSamples::ProfileIsCSFlat)		if (!FunctionSamples::ProfileIsCSFlat)
promoteMergeNotInlinedContextSamples(LocalNotInlinedCallSites, F);		promoteMergeNotInlinedContextSamples(LocalNotInlinedCallSites, F);
return Changed;		return Changed;
}		}

bool SampleProfileLoader::tryInlineCandidate(		bool SampleProfileLoader::tryInlineCandidate(
InlineCandidate &Candidate, SmallVector<CallBase , 8> InlinedCallSites) {		InlineCandidate &Candidate, SmallVector<CallBase , 8> InlinedCallSites) {
		// Do not attempt to inline a candidate if
		// --disable-sample-loader-inlining is true.
		if (DisableSampleLoaderInlining)
		return false;

CallBase &CB = *Candidate.CallInstr;		CallBase &CB = *Candidate.CallInstr;
Function *CalledFunction = CB.getCalledFunction();		Function *CalledFunction = CB.getCalledFunction();
assert(CalledFunction && "Expect a callee with definition");		assert(CalledFunction && "Expect a callee with definition");
DebugLoc DLoc = CB.getDebugLoc();		DebugLoc DLoc = CB.getDebugLoc();
BasicBlock *BB = CB.getParent();		BasicBlock *BB = CB.getParent();

InlineCost Cost = shouldInlineCandidate(Candidate);		InlineCost Cost = shouldInlineCandidate(Candidate);
▲ Show 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	SampleProfileLoader::shouldInlineCandidate(InlineCandidate &Candidate) {

// Otherwise only use the cost from call analyzer, but overwite threshold with		// Otherwise only use the cost from call analyzer, but overwite threshold with
// Sample PGO threshold.		// Sample PGO threshold.
return InlineCost::get(Cost.getCost(), SampleThreshold);		return InlineCost::get(Cost.getCost(), SampleThreshold);
}		}

bool SampleProfileLoader::inlineHotFunctionsWithPriority(		bool SampleProfileLoader::inlineHotFunctionsWithPriority(
Function &F, DenseSet<GlobalValue::GUID> &InlinedGUIDs) {		Function &F, DenseSet<GlobalValue::GUID> &InlinedGUIDs) {
if (DisableSampleLoaderInlining)
return false;
// ProfAccForSymsInList is used in callsiteIsHot. The assertion makes sure		// ProfAccForSymsInList is used in callsiteIsHot. The assertion makes sure
// Profile symbol list is ignored when profile-sample-accurate is on.		// Profile symbol list is ignored when profile-sample-accurate is on.
assert((!ProfAccForSymsInList \|\|		assert((!ProfAccForSymsInList \|\|
(!ProfileSampleAccurate &&		(!ProfileSampleAccurate &&
!F.hasFnAttribute("profile-sample-accurate"))) &&		!F.hasFnAttribute("profile-sample-accurate"))) &&
"ProfAccForSymsInList should be false when profile-sample-accurate "		"ProfAccForSymsInList should be false when profile-sample-accurate "
"is enabled");		"is enabled");

▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	if (I->isIndirectCall()) {
// fixed, but we generate different types).		// fixed, but we generate different types).
if (!PSI->isHotCount(EntryCountDistributed))		if (!PSI->isHotCount(EntryCountDistributed))
break;		break;
SmallVector<CallBase *, 8> InlinedCallSites;		SmallVector<CallBase *, 8> InlinedCallSites;
// Attach function profile for promoted indirect callee, and update		// Attach function profile for promoted indirect callee, and update
// call site count for the promoted inline candidate too.		// call site count for the promoted inline candidate too.
Candidate = {I, FS, EntryCountDistributed,		Candidate = {I, FS, EntryCountDistributed,
Candidate.CallsiteDistribution};		Candidate.CallsiteDistribution};
if (tryPromoteAndInlineCandidate(F, Candidate, SumOrigin, Sum,		if (tryPromoteAndInlineCandidate(F, Candidate, SumOrigin, Sum,
		wenleiUnsubmitted Done Reply Inline Actions I'd avoid random white space changes in general, unless you're intentionally fixing some format issues. wenlei: I'd avoid random white space changes in general, unless you're intentionally fixing some format…
		mingminglAuthorUnsubmitted Done Reply Inline Actions Remove the added black lines here, and three other places. unless you're intentionally fixing some format issues. (This is opinion-based, so just to explain why I add so many blank lines) Blank lines were added when I didn't see a curly brace around one-line if-else, even if it's the code style ( https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies-of-if-else-loop-statements) I'll keep avoid adding blank lines around untouched lines. mingmingl: Remove the added black lines here, and three other places. > unless you're intentionally…
&InlinedCallSites)) {		&InlinedCallSites)) {
		wenleiUnsubmitted Done Reply Inline Actions Remove this comment as well. wenlei: Remove this comment as well.
for (auto *CB : InlinedCallSites) {		for (auto *CB : InlinedCallSites) {
if (getInlineCandidate(&NewCandidate, CB))		if (getInlineCandidate(&NewCandidate, CB))
CQueue.emplace(NewCandidate);		CQueue.emplace(NewCandidate);
}		}
ICPCount++;		ICPCount++;
Changed = true;		Changed = true;
} else if (!ContextTracker) {		} else if (!ContextTracker) {
LocalNotInlinedCallSites.try_emplace(I, FS);		LocalNotInlinedCallSites.try_emplace(I, FS);
}		}
}		}
} else if (CalledFunction && CalledFunction->getSubprogram() &&		} else if (CalledFunction && CalledFunction->getSubprogram() &&
!CalledFunction->isDeclaration()) {		!CalledFunction->isDeclaration()) {
SmallVector<CallBase *, 8> InlinedCallSites;		SmallVector<CallBase *, 8> InlinedCallSites;
if (tryInlineCandidate(Candidate, &InlinedCallSites)) {		if (tryInlineCandidate(Candidate, &InlinedCallSites)) {
		wenleiUnsubmitted Done Reply Inline Actions nit: remove this white space line and avoid unrelated changes. wenlei: nit: remove this white space line and avoid unrelated changes.
for (auto *CB : InlinedCallSites) {		for (auto *CB : InlinedCallSites) {
if (getInlineCandidate(&NewCandidate, CB))		if (getInlineCandidate(&NewCandidate, CB))
CQueue.emplace(NewCandidate);		CQueue.emplace(NewCandidate);
}		}
Changed = true;		Changed = true;
} else if (!ContextTracker) {		} else if (!ContextTracker) {
LocalNotInlinedCallSites.try_emplace(I, Candidate.CalleeSamples);		LocalNotInlinedCallSites.try_emplace(I, Candidate.CalleeSamples);
}		}
▲ Show 20 Lines • Show All 712 Lines • Show Last 20 Lines

llvm/test/Transforms/SampleProfile/inline-mergeprof.ll

	; Test we lose details of not inlined profile without '-sample-profile-merge-inlinee'			; Test we lose details of not inlined profile without '-sample-profile-merge-inlinee'
	; RUN: opt < %s -sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.prof -sample-profile-merge-inlinee=false -use-profiled-call-graph=0 -enable-new-pm=0 -S \| FileCheck -check-prefix=SCALE %s			; RUN: opt < %s -sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.prof -sample-profile-merge-inlinee=false -use-profiled-call-graph=0 -enable-new-pm=0 -S \| FileCheck -check-prefix=SCALE %s
	; RUN: opt < %s -sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.prof -sample-profile-merge-inlinee=true -use-profiled-call-graph=0 -enable-new-pm=0 -S \| FileCheck -check-prefix=SCALE %s			; RUN: opt < %s -sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.prof -sample-profile-merge-inlinee=true -use-profiled-call-graph=0 -enable-new-pm=0 -S \| FileCheck -check-prefix=SCALE %s
	; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.prof -sample-profile-merge-inlinee=false -use-profiled-call-graph=0 -S \| FileCheck -check-prefix=SCALE %s			; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.prof -sample-profile-merge-inlinee=false -use-profiled-call-graph=0 -S \| FileCheck -check-prefix=SCALE %s

	; Test we properly merge not inlined profile with '-sample-profile-merge-inlinee'			; Test we properly merge not inlined profile with '-sample-profile-merge-inlinee'
	; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.prof -sample-profile-merge-inlinee=true -use-profiled-call-graph=0 -S \| FileCheck -check-prefix=MERGE %s			; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.prof -sample-profile-merge-inlinee=true -use-profiled-call-graph=0 -S \| FileCheck -check-prefix=MERGE %s
	; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.prof -sample-profile-merge-inlinee=true -use-profiled-call-graph=0 -sample-profile-prioritized-inline=1 -S \| FileCheck -check-prefix=MERGE %s			; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.prof -sample-profile-merge-inlinee=true -use-profiled-call-graph=0 -sample-profile-prioritized-inline=1 -S \| FileCheck -check-prefix=MERGE %s

	; Test we properly merge not inlined profile with '-sample-profile-merge-inlinee'			; Test we properly merge not inlined profile with '-sample-profile-merge-inlinee'
	; when the profile uses md5.			; when the profile uses md5.
	; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.md5.prof -sample-profile-merge-inlinee=true -use-profiled-call-graph=0 -S \| FileCheck -check-prefix=MERGE %s			; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.md5.prof -sample-profile-merge-inlinee=true -use-profiled-call-graph=0 -S \| FileCheck -check-prefix=MERGE %s

				; Test we properly merge not inlined profile with '--sample-profile-merge-inlinee' even if '--disable-sample-loader-inlining' is true
				; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.md5.prof -sample-profile-merge-inlinee=true --disable-sample-loader-inlining -use-profiled-call-graph=0 -S \| FileCheck -check-prefix=MERGE %s
				; RUN: opt < %s -sample-profile -sample-profile-file=%S/Inputs/inline-mergeprof.prof -sample-profile-merge-inlinee=true --disable-sample-loader-inlining -use-profiled-call-graph=0 -enable-new-pm=0 -S \| FileCheck -check-prefix=SCALE %s

	@.str = private unnamed_addr constant [11 x i8] c"sum is %d\0A\00", align 1			@.str = private unnamed_addr constant [11 x i8] c"sum is %d\0A\00", align 1

	define i32 @main() #0 !dbg !6 {			define i32 @main() #0 !dbg !6 {
	entry:			entry:
	%retval = alloca i32, align 4			%retval = alloca i32, align 4
	%s = alloca i32, align 4			%s = alloca i32, align 4
	%i = alloca i32, align 4			%i = alloca i32, align 4
	%tmp = load i32, i32* %i, align 4, !dbg !8			%tmp = load i32, i32* %i, align 4, !dbg !8
	▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines