This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO][llvm-profgen] Trim and merge context beforehand to reduce memory usage
ClosedPublic

Authored by wlei on Aug 9 2021, 9:18 PM.

Details

Summary

Currently we use a centralized string map(StringMap<FunctionSamples> ProfileMap) to store the profile while populating the sample, which might cause the memory usage bottleneck. I saw in an extreme case, there are thousands of samples whose context stack depth is >= 100. The memory consumption can be greater than 100GB.

As here the context is used for inlining, we can assume we won't have so many of inlinees keeping inlined at the same root function, so this change tried to cap the context stack and merge the samples for peak memory reduction and this is done after recursion compression.

The default value is -1 meaning no depth limit, in the future we can tune to a smaller one.

Diff Detail

Event Timeline

wlei created this revision.Aug 9 2021, 9:18 PM
wlei requested review of this revision.Aug 9 2021, 9:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 9 2021, 9:18 PM
wlei updated this revision to Diff 365624.Aug 10 2021, 4:09 PM

add to algin with line number based context

wlei retitled this revision from [CSSPGO][llvm-profgen] Cut off probe stack from the bottom to reduce memory usage to [CSSPGO][llvm-profgen] Cap context stack to reduce memory usage.Aug 10 2021, 4:35 PM
wlei edited the summary of this revision. (Show Details)
wlei added reviewers: hoy, wenlei, wmi.
wlei added a subscriber: spupyrev.
wenlei added inline comments.Aug 10 2021, 5:05 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
53

I think we could unify the switch names, e.g. csprof-max-context-depth and csprof-max-cold-context-depth?

619

Since getExpandedContextStr only covers line-based profile, for probe we rely on the trimming here in profile generation, which is later then where we do the trimming for line-based profile. Do we see peak memory drop if we trim the context in profile generation instead of during unwinder?

llvm/tools/llvm-profgen/ProfileGenerator.h
73

nit: bottom-up order in stack is usually callers-callee order, from bottom can be confusing as it means we trim callees which is not the case.

also suggest rename capContextStack to trimContext.

hoy added inline comments.Aug 10 2021, 5:06 PM
llvm/tools/llvm-profgen/PerfReader.cpp
23

Nit: move this into ProfileGenerator.h to reduct the number of declarations?

llvm/tools/llvm-profgen/ProfileGenerator.cpp
53

Thanks for working on this. We probably do not inline so many levels of functions. But would be good to run through some perf testing or to turn this off by default.

wlei updated this revision to Diff 365653.Aug 10 2021, 8:46 PM

addressing Wenlei and Hongtao's feedback

wlei added inline comments.Aug 10 2021, 8:51 PM
llvm/tools/llvm-profgen/PerfReader.cpp
23

fixed!

llvm/tools/llvm-profgen/ProfileGenerator.cpp
53

Sounds good, will collect the statistic of the max inline depth in SampleProfile inliner on some benchmarks and change to that one, maybe 10 is good enough.

619

Here it did for both, one during unwinder(see the one in PerfReader.cpp) and one here.

The answer is yes, it's better than unwinder only, here are some data:

10 depth for both: 17GB
10 depth for unwinder only: 26GB
20 depth for both: 42GB
20 depth for unwinder only: 49GB

llvm/tools/llvm-profgen/ProfileGenerator.h
73

Fixed!

wlei updated this revision to Diff 365780.Aug 11 2021, 9:40 AM

fix lint

wlei added inline comments.Aug 11 2021, 1:05 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
53

Here is the max inline depth(a inline b, then b inline c, the depth is 2) in SPEC2017 monoLTO pass2 (turn on all inliners).

508.namd_r  5
510.parest_r 21
511.povray_r  8
526.blender_r 15
600.perlbench_s 8
602.gcc_s 21 
605.mcf_s 5
620.omnetpp_s 18
623.xalancbmk_s 26
625.x264_s 7
631.deepsjeng_s 5
638.imagick_s 10
641.leela_s 16
644.nab_s 5
657.xz_s 7

and for the clang-10 pass1 binary(I don't have pass2 binary), the max inline depth is 51!

it's really more inlining than I thought. so I agree with you to turn it off(-1) by default.

wlei updated this revision to Diff 365832.Aug 11 2021, 1:09 PM

default the value to -1

wenlei added inline comments.Aug 11 2021, 1:33 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
53

Sounds reasonable. If we run into such situation more often, we could also try to have another level of aggregation by leaf frame from stack sample, then we can tell some contexts are cold before unwinding, and dynamically trim those cold context during unwinding.

Can we make the description and variable name consistent with CSProfColdContextFrameDepth too?

wenlei accepted this revision.Aug 11 2021, 1:47 PM
wenlei added inline comments.
llvm/tools/llvm-profgen/ProfileGenerator.cpp
53

to be specific: CSProfMaxContextDepth, CSProfMaxColdContextDepth. "Keep the last K frames while merging [cold] profile ..." otherwise the change looks good.

This revision is now accepted and ready to land.Aug 11 2021, 1:47 PM
spupyrev added inline comments.Aug 11 2021, 1:54 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
74

The default value of "no trimming" sounds strange to me. Do we know if trimming may affect performance? If yes, we shouldn't trim. If no, we should have a reasonable bound here.

wlei updated this revision to Diff 365846.Aug 11 2021, 2:02 PM

rewrite the switch's variable and description

wlei added inline comments.Aug 11 2021, 3:03 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
74

Do we know if trimming may affect performance?

It depends on how compiler consumes the context, if the profile context is [a @ b @ c @ d] but compiler only inline d to c and stop inlining at c, then the [a @ b @] can be trimmed. Another things is the profile size tradeoff, before we did merge and trim cold profile after all samples is generated. This patch tried to move the merging ahead as we encountered the memory issues. it doesn't drop the sample.

If no, we should have a reasonable bound here.

Good question, but I don't know what's the reasonable bound yet. As you see the above comments, SPEC benchmark's max inline depth can vary from 5 to 26, the clang-10 binary even have a 51 max depth. and we didn't get memory issue for SPEC, so SPEC can set to -1.

so it's the tradeoff between many things, what's in my mind is to default it -1 and we can tune it on-demand. what do you think?

hoy accepted this revision.Aug 11 2021, 3:08 PM
hoy added inline comments.
llvm/tools/llvm-profgen/ProfileGenerator.cpp
74

Default -1 (means no trimming) sounds good to me. An otherwise decent default value would need quite some experiments to justify.

wlei retitled this revision from [CSSPGO][llvm-profgen] Cap context stack to reduce memory usage to [CSSPGO][llvm-profgen] Trim and merge context beforehand to reduce memory usage.Aug 11 2021, 3:12 PM
wlei edited the summary of this revision. (Show Details)
spupyrev added inline comments.Aug 11 2021, 3:13 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
74

I dislike tunable flags that may or may not affect the performance. I assume most (all?) of the end-users won't know about the flag and will utilize the default value. So whatever you set here will be used in the vast majority of cases.

If course, if you do plan to run some perf experiments in the (near) future and tune the default value, then the current value of "-1" is OK.

wlei added inline comments.Aug 11 2021, 3:32 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
74

I dislike tunable flags that may or may not affect the performance. I assume most (all?) of the end-users won't know about the flag and will utilize the default value. So whatever you set here will be used in the vast majority of cases.

That makes sense, thanks! Will tune it, maybe clang binary is the good place as it has the deep call stack.

This revision was landed with ongoing or failed builds.Aug 11 2021, 4:02 PM
This revision was automatically updated to reflect the committed changes.